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ABSTRACT 


A special purpose multiple processor system, the Shared 
I/O, is discussed. It is a method of organizing multiple 
processors such that all processors perform identical 
Operations, pluem toe thes naturessoiathe structure, 16 1s 
important that the communication between the processors be 
managed in as short time span as possible. Two designs 
that Supervise the inter-processor communications in such 
an environment are presented. The concept of shared I/O is 
also compared with other schemes used for similar  pur- 


poses, in particular, pipelining. 
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CHAPTER 1 


Introduction 


ieee An MOVELV.1ew 


In the recent past, many computer systems have been 
designed specifically to implement a single algorithm or a 
group of Similar algorithms. These special purpose 
hardware units are used for algorithms that require an 
extensive amount of time to execute on a conventional com- 
puter. Examples include systems developed for computer 
assisted tomography [BROO76, SWAR82], analyzing data from 
Satellites [ONOE81] and real time processing of signals 


and images [RABI75]. 


This thesis presents, a design of yet another’ spe- 
cial purpose system, referred to as the Shared I/O organi- 
zation, developed by Heuft [HEUF80]. Shared I/0 is a 
method of organizing multiple processors such that all 
processors perform identical operations. This characteris- 
tic makes the shared 1/0 well suited for implementation 
using commercially available LSI components’ such as 
microprocessors. Due to the nature of the structure (to 
bé discussed later)} “it is°of* paramount importance that 
the communication between various modules within the sys- 
eer be managed in as short time span as possible. Two 


designs that supervise the inter-processor communications 
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1.1 An Overview 2 


in such an environment are described. The concept of 
Shared I/0 is also compared with other schemes used for 


Similar purposes; like pipelining. 


The next section relates shared 1/0 with the other 
multiple-processors systems available today. It is fol- 
lowed by a discussion of the shared I/0. The subsequent 
chapter compares shared 1/0 with pipelining and systolic 
arrays. Systolre jarrayseganed highly regular cellular 
structures that work on the principle of pipeline yet have 
a close resemblance to shared I/O. It is followed by a 
description of the two methods of managing inter-processor 
communications. Conclusions and recommendations complete 


the thesis. 


1.2. Shared I/O and other multiple-processor systems 


Since the time of Unger [UNGE58], there has been a 
Significant increase in the number of computer systems 
which attempt to improve the performance to cost ratio. 
Most of them make use of multiple processors cooperating 
with each other to achieve the desired performance levels. 
Since the shared 1/0 scheme also uses multiple processors, 
there is a need to present a clear idea of what shared L7O 


ismanecelationmtesother systems. 


There are two approaches: one is based on a particu- 


lar architectural feature, the memory and the other, on 
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1.2 Shared I/O and other multiple-processor systems 3 


the degree of cooperation among the processors in the sys- 
tem. We discuss the above imethe following sections. 
Figures. vllustrates sthe diseussion =to » follow weinwea 
graphical manner. For completeness, a brief description 
of each node in the graph is given which includes few 
representative examples. Before we go into any detail we 


first briefly describe the shared I/O organization. 
pezaiuem Shared#i/Opiniaimutshell 


The shared 1/0 organization is a special purpose sys- 
tem designed to implement a certain class of algorithms. 
It consists of multiple processors that cooperate on a 
computation. Figure 1.2 shows a set of processors con- 
nected in the shared I/O configuration. Each processor 
has its own memory that 1S not accessible to any other 
processor in the system. A simple interconnection scheme 
connects adjacent processors together to form the communi- 
cation link. The first and last processors are considered 
adjacent. In shared I/0, processor(i) sends values to 
processor(i+1) and receives from processor(i-1). There 
AremONl yes two 8/0 devices fOr thewentrre sSystemmeandechey 
are shared by all the processors (hence the name _ shared 
bi AGO Ye The major envisaged use of the shared 1/0 is its 


function as a peripheral to a main computer. 
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1.2.4 Sharedsei/e bn aenutshel] S 


1.2.2. The aspect of memory 


The processors in a multiple processor system can 
either share a common memory or operate from within their 
own private memories. The multiprocessor as defined by 
Enslow [ENSL77] is a shared memory system. It consists of 
two or more processors sharing a main memory and 1/0 dev- 
ices under the supervision of a single operating system. 
It also has the capability to allow interaction among the 
processors both at software and hardware levels. C.mmp 
[WULF72] and Cm* [SWAN77] are two well known models of 
such a multiprocessor. The shared 1/0, though it allows 
the sharing of I/O devices, does not permit the sharing of 


memory. 


When the memories attached to the processors are 
Separate from each other, we can have several distinct 
multiple processor systems. There are distributed systems 


and computer networks. 


Enslow defines a distributed system to be character- 


ized by five components [ENSL78]: 


(f)@eitehasea numbeneof processorsfwhichaycan! beredynamie 


cally assigned to specific tasks; 


(2) the processors are physically distributed and 
interact with each other over a communication net- 


work; 
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1.2.2 The aspect of memory 5 


(3) the distribution of the processors and the computa- 
Evong) theyeseper form iis transparent) to the user. ‘That 
1S, a user 1S unaware of the distribution of the Sys- 
tem and thus is unable to specify the processor he 


would like to use for a particular computation; 


(4) the processors cooperate with each other on a compu- 
tation though they remain autonomous with respect to 
eachs other. \Thatlasj;etheyrretain the right to refuse 


a service request from other processors. 


(5) and finally, each processor has its own local operat- 
ing system; there is also a common operating system 
that supervises and integrates the operations of the 


entire system; 


HXDP developed at the Honeywell is an example of an exper- 


imental distributed system [JENS78]. 


Sharedul/Otconforms! tobe thesmaboveie defini ticnsLwith 
respect to items 2 and 4. The processors in the shared I/0 
are autonomous, because they may not grant their neighbors 
permission to communicate with them until they are ready 
to do so. For instance, processor P(i) may not process a 
send from P(i-1) until is ready to receive the data. As 
foredtem 1, tasks aresnote dynamicaldy assignedeito! thespro- 
cessors in the shared I/O during a computation; they are 
determined at the beginning of the execution. Similarly, 


treme is Snot Mapplicables because, fithe processorsaingthe 
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1.2.2 The aspect of memory 6 


shared 1/0 perform identical tasks. Thus) sthePSoshared Sens 


is not a distributed system. 


Computer networks are similar to distributed systems 
except that there is no system transparency. The user 
must be knowledgeable about the resources available else- 
where and specify explicitly how to access them. Also 
there is no high level operating system that controls and 
integrates the underlying hardware. To state it other- 
wise, computer networks are designed for resource’ sharing 
rather than sharing of the computational load. Shared I/0 
is not a computer network because itS main objective in 
introducing cooperation among the processors is to Share 


the computational load. 


Borner multiple processor systems we have not yet con- 
Sidered so far are systems that exhibit a master slave 
relationship. These are neither multiprocessors [ENSL74] 
nor distributed systems nor computer networks [ENSL78]. 
Shared I/0 could be classified as a member of this 
Cavegon vasa Namitha tem + ODeTa LCS moo mmomslaVen*LOmomaMague ta 
Within the slave there could be multiple processors 
eooperating on a task. For example, the pipelined central 
processor in TI-ASC operates as a slave to the peripheral 
processor, which executes the operating system [WATS72]. 
The peripheral processor analyzes a job and initiates the 
central processor on the execution of the job after fur- 


nishing it with the required data. In a similar fashion, 
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1.2.2 The aspect of memory 7 


Pipelined array processors such as Floating Point Systems 
APefambly;eare designed stotbe speripheral) devices ‘that “jact 
on a command from the master. [BERN82] gives an example 
of how a matrix multiplication operation is carried out on 
theghPSeAP seThesmaster, «reférredatosas®the thosts,ttranstfers 
ale theedata necessary for the multiplication to: the array 
processor and receives the results from the latter. The 
shared 1/0 can be programmed to do operations similar to 
that of the FPS AP family. The difference between the two 
is the methods of computation employed. Shared I/O 
exploits parallelism in the given operation while FPS AP 


uses pipelining. 


1.2.3. The aspect of cooperation 


When there are multiple processors ina system they 
usually cooperate among each other to share the computa- 
tional workload. There are three levels of cooperation a 
multiple processor system may exhibit: job, task or 
instruction. “At the jobelevel, different ‘processors “are 
initiated with separate jobs independent from each other. 
Traditional multiprogramming as done on a multiprocessor 
is an example. The experimental version of Michigan Ter- 
minal System (MTS), is a multiprogramming system built 
around two IBM 360/65MP processors [SROD78]. At any given 
time, the processors are executing different jobs ori- 


ginating from separate users. The cooperation between the 
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1.2.3 The aspect of cooperation 8 


processors is in the form of observing the etiquettes of 
Sharing common resources, such as memory or an I/O device. 
Thus communication needs are at a minimum. The PRIME com- 
puter [BASK72] is an example of a private memory multiple 


processor system in which the processors cooperate at the 


job level. 


At the task level of cooperation, a job is parti- 
tioned into several tasks that are distributed among the 
processors. Since all the tasks must cooperate with each 
other, it 1S necessary for some communication mechanisms 
to be established between the different tasks so that the 
values produced by one may be transferred to another. The 
communication needs at this level of cooperation ena to 
be moderate because the job iS partitioned ina way that 
would minimize the communication overhead. C.mmp isS-—= an 
example of such a multiprocessor, consisting up to sixteen 
processors with task level cooperation [WULF72]. Cm* 1s 
another instance of task level cooperation in a multipro- 
cessor [SATY80]. All useful computation in Cm* is_ done 
PnEcugh tasks torces, which ane activities thatemayeexecute 
tie parallele., HXDP = [JENS78] “ise@an eexanple wot private 


memory system that exhibits cooperation at the task level. 


At the instruction level of cooperation, independent 
instructions are executed in parallel on different proces-— 
sors. There are three methods of organizing the instruc-— 


tion level cooperation: parallel, pipelined or overlapped 
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1.2.3 The aspect of cooperation 2 


operation. ParallelseéexecttiontSrefers to performing 
independent operations Simultaneously on different proces- 
sor. In a pipeline, the execution function is subdivided 
such that the intermediate results move from one stage to 
another for application of another subfunction. For exam- 
ple; ithe floating point addition om the T1-ASC [WATS72) as 
a pipelined function. Overlapped processing is different 
from pipelining in that the evaluation of the function may 
require a different sequence of subfunctions depending 
upon the dynamic state of the system [KOGG81]. The 
instruction overlapping in IBM 360 model 91 [ANDE67] is an 
excellent example. Pipelined vector processors such as 
the TI-ASC and CDC STAR utilize the parallelism among 
identical operations that have to be performed on large 
vectors of data [RAMA77]. STARAN and PEPE are examples of 
parallel execution of independent instructions in which 
the operands are broadcast to a set of Similar processors 
[RUDO72, EVEN73]. MPP and ASPRO are improvements over 
STARAN developed at the Goodyear Aerospace that make use 
of advanced technology for increased speed of operation 
[BATC80, BATC82]. The concept of data flow is yet another 
example which employs cooperation at the instruction 
level. In a data flow computer, the execution of each 
instruction depends upon some other instruction which pro- 
duces its inputs [TREA82]. Shared 1/0 is a multiple pro- 
cessor system that cooperates at the instruction level. 


In it, each processor may depend upon values produced by 
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BES badjacenteeprocessor kefor sits proper operation. High 
Speed communication facilities are required at this level 
of cooperation since the intermediate results have to be 


passed among the processors on a time scale comparable to 


that of instruction execution time. 


1.3. Summary 


To summarize, we have presented the shared I/0 
Organization in relation to the multiple processor com- 
puter systems available today. It is a special purpose 
system that cooperates at the instruction level. It has 
no common memory and it uses shared I/O devices. Tte can 
be used as a peripheral processor to a main computer. The 
next chapter presents the shared I/O in a greater detail. 
It develops specific relations that characterize the sys- 


tem. 


Before we conclude this chapter, we would like to say 
a few words about special purpose systems in general. We 
have mentioned that shared 1/0 is a special purpose. sys- 
tem. By being special, it circumvents most of the problems 
that are commonly related to general purpose systems. For 
example, the interconnection scheme used for shared I/O is 
very simple because its requirements are very specialized. 
Communication is in one direction only and between adja- 
cent processors. In contrast, the interconnection network 


required in a general purpose multiple processor system 
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can be very complex and hence expensive [FENG81]. Thus, as 
Haynes, et al., point out [HAYN82], "a very powerful way 
to optimize [a computer system] is to tailor them to a 
Specific problem or class of problems". By carefully 
choosing the problems and appropriate algorithms to solve 
them, it 1S possible to design systems that exactly 
Satisfy the requirements of the problems. However, the 
disadvantage of such a design methodology is that the 
applications that can be programmed on the special system 


become too restrictive. 
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Figure 1.1. Shared 1/0 in relation to other 
multiple processor systems 
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Figure 1.2. n processors in the shared I/O configuration 


CHAPTER 2 


Shared 1/0 Organization 


2Etic Introduction 


The shared I/0:organization is designed to implement 
a certain class of algorithms. These algorithms have the 
common characteristic that a large amount of time is spent 
in executing a single main loop. Many signal processing 
algorithms have such a characteristic. In the shared I1/0 
scheme, all the operations within the loop are assigned to 
one processor. Each processor executes one repetition of 
the loop and adjacent processors execute the successive 
repetitions. Thus all the processors are executing ident- 
ical program code. AS an example, consider the following 
FORTRAN loop: 
i DOSSE lL Ser yk 
2 Vil )e=" AleX (i) ee Besey (1a 1) Ste Cera (2) geo 
S CONTINUE 
Statement™2 constriutesé the, program Mor eachMoi ther hpro- 
cessors. Figure 2.1 shows one repetition of the loop as 
programmed for the shared I/O organization. Each proces- 
Sor accesses the input device to read an input value, 
X(I)', and applies the operators defined in the loop body 
— ‘Here and in the rest of this thesis, for a variable 2, 
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bops CPL MuUsThiseresults tin tithe ligenerationetoimiania outout 
Value iy GD whi che sitthentsent thortthe toltputtdeviice 1 Ror 
the above program, it can be seen that each processor 
requires two previous values Ou ry. produced by 
processor(i-1) and processor(i-2) respectively to generate 
MGly) sei neeti gures: 22 ly (1 eand=Y(l—2) tappeaninogecm sene 
left are the values of Y produced by the repetitions (I-1) 


and (1-2), respectively. 


Since the variable X is used for internal purposes 
Only ts ethat tis f is not required by any other processor - 
DUMrseconsvdered localetonthessrocessom producingm REA tht 
is referred to as a local variable and each of its values 
as a local value. On the other hand, the Y values are used 
in some other repetitions as well as repetition I and con- 
sequently*® are® referred) *toll%asinon=local\ values* The 
corresponding variable, Y, is referred to as the non- 
local variable. Associated with each non-local variable 
there is a non-local segment which, in repetition cae 
Culates the non-local value; «Y(1).> For sthe local ‘values; 


there are corresponding local segments which produce them. 
2.2. Specifics of Shared 1/0 


The time lapse between the receipt of a non-local 
Value, (SayuY(1=1))) and the transmission of a corresponding 
value, Y(1), is denoted by Tns (see figure 2.1). Gas 2S 


the delay from the time a non-local value is required from 
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OV 


avprevious repetitioneto thectime ithe corresponding value 
1S available to the next repetition. ft 1s essentially the 
time necessary to execute the non-local segment associated 


Wl ehfthativariabler 
Tns* 15 definedsas the maximumect allesuch Tns: 
Tnste=ameaxo.ense 22 


It represents the worst-case time delay between the time 
when repetition I must receive a non-local value until 
repetition (I+1) is able to receive the corresponding 
non-local value. Since the rest of the program on all the 
processors is identical, Tns* is the time delay between 
generation of successive output values. Therefore, the 
maximum throughput that can achieved for a given algo- 
rithm, R*, expressed as the number of outputs per unit 


SmMouel se, 
Rea= slay eins Sas! 


Tns* iS an important characteristic of an algorithm since 
it is a measure of the extent to which succesSive repeti- 
tions can be overlapped. Consecutive repetitions have to 
commencesat an interval); TS,sof atcleast\ Tns*, sosthateno 
PEeeessormwill require a non-local value |before it sis spro- 


diced (figure 2.2). 
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Snareo 170, Inst 1s zero (figure 2.3). 
Mole tA kOe Cie Bo XC) Cheek io eee ty eh a 2d 


A zero value for Tns* means that if all input values are 
available before execution begins, then all repetitions of 
the loop can begin executing in parallel. However, if suc- 
cessive input values are available only after a time delay 
of Ts as shown in figure 2.4, then the performance of 
these algorithms is limited by Ts rather than being infin- 


ite as defined by equation 2.3. 


Let Trep denote the time required to execute all _ the 
operations associated with one repetition of the loop. 
Then the number of processors needed to achieve maximum 


EnECOUgHDU te us 
mx = [ (Trep / Tns*) | ees) 


where — x 7] iS the smallest integer greater than or equal 
GO ney That is, if m* processors are available, then the 
first sprocesson,, P&1),,- completes «the texecutionteciesits 
repetition at the same time when the processor adjacent to 
P(m*x) is expected to begin execution. Hence, P(1) can be 
considered adjacent to P(m*) and begin the subsequent 


repetition. 


The time each processor idles at the end of “an 
Peper On ae CiymeWil le beranfinlteaquant icy Since m* is an 


integer. This is essentially the time spent on 
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Synchronizing with the input device. 
PRecemt ox iT ns se Prey 2.6 


If less than m* processors are available, then a 
smaller repetition rate can be accommodated. Combining 
equations 2.3 and 2.5, we have the following equation, 
expressing the reduced throughput, R(m), in terms of the 


number of processors, mM, 
R(m) =m / Trep 1 Sm Ss me De 


Equation 2.7 indicates that the throughput of the 
multiprocessor can be varied to achieve a desired level of 
performance. For example, if the input values occur in 
real time at an interval of Ts, then the number of proces- 


sor required is, 


| (Trep / Ts) | Ts > Tns* 2c 
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Pe curing Trep, Ci)  ‘sendsmn. non-local values ta 
P(i+1) before the later receives the first of these n 
values, then the data transfer is said to be interleaved 
to a degree n. If P(i) produces a maximum of k non-local 
values, k 2n, then if P(i) sends all of the k values 
before P(i+1) begins to receive them, then the degree of 
PLerleavingmlsehemineciice CaSC,sedmamltcta ll) smeis DOUL 
butter. of Size k,e1S necessary. between the processors Eto 
store the values until P(it+1) is ready to receive’ them. 
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where we need a buffer of size n. 


Figure 2.5 illustrates a hypothetical case in which 
one cegree or interleaving is twor "Let Trs*denote= the 
elapsed time between a receive executed by processor(i) 
and she nthpsendatovprocessor(i7+1) following this ureceive 
(see figure 2.5)., The time processor(i) must wait at the 


Poot ahe,Ot Wnt thesend 16 (tigures?. 6). 
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The accumulated wait time at the end of each repetition 
will be, 
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TW ee) veer Sor Peal 
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TieOrldera tos circumvent this waitlrime, sthe minimumect, all 


such Trs ina. giVen- program, .Trs*, must, be such that, 
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(see figure 2.5). For example, consider the equation 2.4, 
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Values Of Heth, input and output respectively. If the 
response of the falter, to,san,impulse input is of finite 
duration then) the filter is ™acalled finite Impulse 
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Figure 2.3 displays the program that would be executed by 
each processor. Send and Receive statements identify the 
transfer of non-local values between processors. In this 


example, the degree of interleaving is one. 


Processor P(0) initiates the execution by sending the 
initial values of X1, X2 and X3 to its neighbor processor 
P(1). P(1) reads the first input value from the input dev- 
ice and produces the corresponding output value Y in 
accordance with equation 2.4. In the course of the execu- 
tion, it sends a set of three new non-local values to pro- 
cessor P(2). Each processor performs identical operations 
on succesSive input values to produce successive output 
values. Figure 2.7 shows an instance of a particular 
implementation using six processors (only P(1), P(2) and 
Pig) are shown). For the purpose of aiiustration, invtic- 
ure 2.7, Trep, Ts and Trs* are assumed to be 51, 9 and 10 


time units respectively. 


Figure 2.8 shows a case when Trs* is zero and Ts is 
9 time units. Since the number of sends in a repetition 
vouchree,, by equation, 2. 10) the totals Walt = Limes onmesend 


is, 
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processors required then equals, 


ue wien can roRade 
qehasow tg Séditp ton suit Ge '% ac eeulav 
-veb Jugnt, sii2mord utes aeght teui? oe) efaer UI, 
te Y SUseY JuRTLO erenognay ze) eiy: gosuberq Bia. sot _ 
eet ine  te wit were | ae « gi shy Aottauee igix somnb30328 | - 
- wit) ¢3 2guilav. féneh*AGn. wan sans 40, s8e 4 abnee 71 moka : 
ancissxseo, isstyneb). amsolsen <oeessoud Jena . (414 youeeS ; 
iuqQzus svieeediyeal sovbniq oc! «wulev iugqni eviaseosee Ae 
dniuoigsag oa 3s vecmalsti ne avede *.< sought -asulay | 

Bie (Shes, tha faoal atosebo61q tin ears Ast suenuialal 

G23 nt: \WePPeiseyli? to sasqtug edo seo ((eweterere (E7E 

O° Oo6 21 fa wis Cols! Ramia2s e326 ¢a27 bn6 eT Gert VLE ein] 


,glsvc ioepet esine ami2 


al vt ons stan BE: eit nodw sens @ @vode 3,0, steed a 
wwkg Laqe7 B wm mhetinee: $3 isdmo ot emit eat eat 2 
a al Vacealeal de cath alee 


i.8 SPeGEEVCS Of Shared 1/0 Zit 


oS 
" 


find San eogoig 
= 8 

In figure 2.7, the send instructions are moved down until 
Trs* becomes greater than Ts. The wait time on send can be 
S€cnatOnbe, zero.) Figure 279 allustrates’ a case: an which 
Ts has been increased to such an extent (Ts=12) that the 
adjustment in Trs* (Trs*=10) resulting from the shifting 
of the send instructions does not still satisfy equa- 


tion 2.11. Subsequently the total delay incurred is, 


Tw C12 aiOs) werent Deo 1.) 


I 


= 4 time units 
The number of processors required now is, 


PRUs ee ry Aakee 


m 


= 5 
2.3. Summary 


In this chapter we presented the details of the 
shared 1/0 organization. It is a Special purpose organiza- 
tion developed for implementing a few specific algorithms. 
Many of the signal processing algorithms, like the FIR and 
IIR filters we considered here, have been found appropri- 
ate for implementation on the shared I/O system [HEUF80]. 
It operates Pcveecring succesSive repetitions of a 
program loop to an extent determined by the characteris- 
tics of the algorithm. We developed relations that 


described the throughput, the number of the processors, 
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2.3 Summary ae 


the idle time on input synchronization of the system and 
certain conditions to reduce the delays incurred during 
transfer synchronization. In the next chapter we will 
present a comparison between shared I/O and the concept of 


pipelining. 
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Ith iteration 


VCE) mY Cl) tCey(To2) 


Y(I)=Y(1)+B*y(1I-1) 


Y(1I-2) 


¥(1I-1) 


Tns2 y(1I-1) 


Send Y(I) 


ee 


Figure 2.1. Ith iteration of program 2.1 
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2 Shared I/O Organization 


P(i) PGi sa], 


Ts=Tns* 


Figure 2.2. Scheduling adjacent processors 
Withee tnste>. 0m Seems « 


A,B and C are non-local values. If the interval, Ts 
is less than Tns#*, then P(it+t1) would have waited at 
second receive until B becomes available. 
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Ith iteration 


X(I-1) Tsend 


X(I-1) Trcve 


&(1-2) 
Y(I)=Y¥(I)+B*x(1I-1) 
K(I-2) 
Receive X(I-2) 
Send X(I-2) 
8 eB) 


Y(1)=¥(1)+C#X(I-2) 


Receive X(I-3) 


¥(1)=#Y(1)+D+*X(1I-3) 


| omer 


Figure 2.3. Ith iteration of equation 2.4 _— 
Tns* = Trcve-Tsend = 0 (since time cannot be negtive) 
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P(i) P(i+1) 


Ts>Tns* 


Figure 2.4. Scheduling adjacent processors with 
Tns* = 0 and Ts > Tns+* 
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P(i) P(i+1) 
te (it. 4) 
Ts 
ts(1 2) 
Trs 
PS 3 )s ne Gee att) 
tule) 
tri (2553;) 
Legend: 


ts(i,j): time P(i) sends value j 
tr(i,j): time P(i) receives value j 


For zero wait time at ts(1,3), we have, 

ow iGhy cul Ps Taellrd ah) 

tr (2el)V=sts (23) cea Trs 
If there is no wait time then, 

ts (203) (=stsiee3) + Ts 
Therefore, 

ts (4-3) ceets (123) +51sa— Trs 


Trs 2TSs 


Figure 2.5. Interleaved data transfers with the 
degree of interleaving as two 
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P(i) P(i+1) 
£5 (ie) 
Ts 
coi.) 
Tres 
eS (is) 
1 a ee 
tries 2) 
1) Sh PRES) 
Legend: 


ts(i,j): time P(i) sends value j 
tr(i,j): time P(i) receives value j 


At ts(1,3), P(i) must wait since the buffer is 


Bulle 
Theswait. time, Twi setri2, )jeoets (3) 
Cr (2 jwe ts t2 3). Tos 
ES (2. sp a= etS (1, sUmtaTS 
Therefore, 


Tw TS3-. LES 


Figure 2.6. The case when Trs < Ts with the degree 
of interleaving as two 
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X2 
Send X(I) > Ce |) Input X(I1) 
X(T=-1) 


Reve X(I-1) 


SUM = 
A * X 
Trs X3 — 
=10 Send xX(1) X(I-1) Input X(1) 
SUM = Reve X(I-1) 
SUM + TEMP 
SUM ® 
Send X(I-1) Ale xX 
X(T 2)) TEMP = 
Reve X(I-2) B * X1 
Send X(1) {Gt Wh )) 
X(1=1) 
Cae, x2 SUM + TEMP 
Send X(I-1) X(1-2) 
X(1I-2) TEMP = 
SUM = Reve X(1I-2) B* X1 
SUM + TEMP 
Send X(I-2)| X(I-3) 
X(I-3) TEMP = SUM = 
Reve X(I-3) ( a3 a SUM + TEMP 
Send X(I-1) X(I-2) 
X(1-2) ; 
TEMP = SUM = Reve X(I-2) 
Die ks SUM + TEMP 
Send X(I1-2) HOt 3 SS} )) 
X(I-3) TEMP = 
Y = Reve X(1-3) C * X2 
SUM + TEMP 
Z 
TEMP = SUM = 
eg ee eee te D* X3 SUM + TEMP 


Input X(1) 


Sends x (l= 2)) i C= 3)) 


Figure 2.7. An instance of equation 2.4 


with Ts=9, Trs=10, Trep=51, m=6. 
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Figure 2.8, Another instance of equation 2.4 


with Ts=9, Trs=0, Trep=69, m8. 
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Figure 2.9. Yet another instance of equation 2.4 


with Ts=12, Trs=10, Trep=55, m=5. 


CHAPTER 3 


Comparison with pipeline structures 


Techniques to achieve increased speed of operation 
include parallel and pipelined execution. The first 
achieves high performance by executing several evaluations 
OfTPaehfunchrvonssin pardlbedron differentrdata. tThevlater 
splits the function to be performed into several subfunc- 
tions. Each of the divisions is allocated to separate 
piece of hardware to execute. Both the methods are useful 
in applications where repeated evaluation of a function is 
required. AS an example, consider the following FORTRAN 
program: 

DOPSIT=a 1, N 


VAG ae cs De Ge Gee Ce A) Sei 
CONTINUE 


WN — 


In pipelined execution method, the function represented by 
statment 2, could be executed in two stages. In the first 
Stage a partial value of Y(I) is generated by multiplying 
X(I)uby A and’ passed to’ the second stage. Here» the» opera= 
CHOMSBEZ( P)BIS® Carried®iout and added to the "partial 
result. Increased speed of operation is achieved by over- 
lapping successive evaluations of Y(I). In the parallel 
execution method, several evaluations of statement 2 


eouldaube Simultaneously carried -out@on several’ processors: 
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Though both methods achieve the same goals, for some 
applications, there are certain advantages in using one of 
them in preference to other. A general comparison of 
thesesstwoom methods® cant behifound@min [KOGG81). toinethrs 
chapter, we will restrict ourselves to the comparison of 
pipelining and shared 1/0 which performs a computation in 
parallel. In the next section we will make the comparison 
between pipelining and shared I/O in terms of throughput 
and efficiency. AS an example, we will consider’ the 
implementation of FIR filter in a systolic array. Systolic 
arrays work on the principle of pipelining. In the later 
part of the chapter, we will discuss some more details 
that are specific to shared I/O in comparison with pipe- 
line structures. These details include such aspects as 
incremental expandability, software and hardware require- 


MeEMUESwee Cetera. 


3.1. Performance Considerations 


Performance evaluation includes the measurement of 
throughput of the system and the efficiency expressed in 


terms of system utilization. 
3.1.1. Throughput Considerations 


Throughput is defined as the number of outputs or 
instructions processed per unit time. Consider figure 3.1 


whichnshowStamepipebine vastructureeawith Saboupmehunceyonal 


news l enw nt 
‘ove. 8 ie pnihidegiq fa 
dd satibiannsy Bian _ cate = _yuneesthies 
siloteys (.7e1an, oa toa a oi vesbha Bit to. ao) aw 
vajal a2 aljvprtiatieteg 99 siqtodiag =i so tev 
gileten oton swom teideto btivsy j3etged> ade Gow 
“gh , Aaah. aed tegen ae Ot beemat ae >bitoege a30-4 
as efateés ftoawe (any Laat alieghe gaudy 221350398 
-dafuess 9 avaWire bas scawiier ,yailidehnaqas inseaneront 
<r 16 .019i89 89 4; 


3.1.1 Throughput Considerations 34 


UM Gs. eM hachtunateimplements@a partheularesubtunct ion OWN 
non-pipelined implementation, the total execution time 
wWOULG DeWInp=tatt. toot ,o times units wherewraetor i 8 4p Als 


the time to evaluate each subfunction. That is, for every 
Tnp time units there is an output from the system. In the 
Plpel Ine versvon,, 1b would require only Tp units ) of atime 
to perform the same operation where Tp=max{t; }. The stage 
of the pipeline that needs max{t, } time, is referred to as 
the bottleneck. The maximum performance is limited by the 


capability of the bottleneck. 


For an equivalent implementation in shared I/0, the 
throughput is defined by equation 2.3 as 1/Tns*. If the 
POUeerave.l/ TS ts) SUCH athatelcS* sear Seon rnes)  tnenesetie 
throughput is only limited by 1/Ts rather than the evalua- 
tion a particular subfunction; as 1/Ts increases, the 
throughput will increases proportionately. However, if 
Tomor TS stOresome Ts, — Trs* 2 .-1S¥2 Inst) ) then as prpeline 
structure may achieve a greater throughput than shared 


ABTA 
3.1.2. Efficiency Considerations 
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number of subunits of an organization that are busy ina 
Given periodlot etime. sin avprpeliney there may. bewseveral 
Subunits each requiring different amount of time to evalu- 


ate the subfunctions. This complicates the measurement of 
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3. ..2) Betievency Considerations cle 
efficiency because, some of these subunits may be busy 


while some other are idle. However, assuming equal execu- 
CeO eee Meo Crdetinea Nati) Sioreallirhemsupunvts a tsimple 
relation may be derived. Lee [LEE 80] defines the utili- 


Zation factor, Ut. as, 


If L is the number of functions to be evaluated, then Tseq 
is the time to evaluate the L functions on a sequential 
machine, Tpar is the time to evaluate the same number of 
functions with p processors. For a pipeline, p is equal 
to n, ‘the number of stages. Let each function be made up 
Ofvmesubiunetionss,. If sthe unit of time 1s the time to 


evaluate one subfunction, then, 
T= hose n time units 


When executed in a pipelined fashion, it requires n time 
units to evaluate the first function and (L-1) time units 


to evaluate the rest. Thus, 
Weve = ie ae (Uke hy) 


PHenmtchemutLiourza plOnmLACTOL sls- 


1. could be the number of loop repetitions to be exe- 
CUuLece ance Che neunCElOneCOUld bem une COpeDOOyY. HOrsexal-— 
eiYD, ial cefefelepertn Si, IS) ME) ik) Bieter Chechens Zou AdiXS  jaibigle™ 
tion to be evaluated. This function is made up of two 
subfunctions, namely, A*X(I) and B*¥Z(I) (n is 2). 
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It can be easily deduced that Uf is always less than unity 
unless L is infinity or n is one. When n is one there is 
n@ipspelimingsWhemis iSsciniimity, tftheaeacomputation has 
lasted long enough to offset the initial delay involved in 


Perlingmupeuhes pl pe. 


For the shared 1/0, the time to evaluate one complete 
RuneLionm giswe rep) Then the time, Tseq, to evaluate L 


functions sequentially is, 
USCOe—-ebe a Tép 


It takes Trep time units to obtain the first result and 
there is an output from the system every Ts time units 
thereatter (recall that the throughputs for shared 91/0) 71s 
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Burthen Wsimpiaticatione wiblesarice {ftewertassumesethat 


m= (Trep/Ts). We get, 


WegnorcesLnatathic relation, vs sum lareto equationm sme ean 
utilization factor of one is attainable only when m is one 
Gratin Sieli bi ilisyee lia teats both pl peltniemmancmmsharedmamly © 


utilize the available resources to the same degree. 


Thus, we have shown that shared I/O performs better 
than pipeline in terms of throughput for certain algo- 
Eithms) (for which Ts < Tp)s) =Since the degree of Mutileza— 
tion is same for both approaches, the overall performance 
of shared I/O is better than pipeline for these algo- 


UC NAMSs. 


In the next section, we will consider an example of a 
Pipeline implementation of FIR algorithm and compare the 
results with that obtained for a similar implementation on 
shared I/O. Our aim here is to illustrate the differences 
in performance between shared 1/0 and a pipeline that has 


actually been implemented. 


3.2. Systolic Arrays: an example 


Systolic arrays are highly regular structures that 
function on the principles of pipeline. A systolic system 


consists of a set of interconnected cells, each capable of 
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performing some simple operation. Intermediate results 
from a computation flow through the cells in a ~~ pipelined 
fashion [KUNG82]. Originally designed for doing matrix 
computations efficiently and cost-effectively with VLSI 
technology, it can be easily be adapted for signal pro- 
cessing applications by formulating the problems as matrix 
Operations. Kung and Leiserson describe a method of per- 
forming the necessary calculations for the FIR filter 


[KUNG80]. 


A linearly connected network is employed in which the 
inputs to each cell are from the left, right and the top. 
Figure 3.2 shows one cell of the systolic array and one 
processor of shared 1/0 configured to implement the FIR 
filter. Recalling equation 2.4%, we notice that in shared 
I/O, the coefficients of each term are stored within the 
processor during power-on time, and stay until the comple- 
tion of the computation. The X values move from one pro- 
cessor to another after entering from the top, and the Y 
values stay only to be output at the end of one computa- 
tion. In the systolic cell, the coefficients move in from 
the top, a fresh set entering the cells for each computa] 
tion, the Y values move from right to ect and the X to 
Shem@erignterfrom™ Lert. Figure 3.3 shows the FIR filter 
expressedhasea Multiplication of ®thesscoetficientermauri x; 
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QGasuselsrnes Xenaarxeproducing the VY matrix aamtne wesult. 
Irgunes® 2.46 ishows! the cell “connections sto pertormache 
destredenultiplucation. During the course of the multioli-— 
cation, only alternate cells are active on each cycle. 
Bachece | Ve pertorums one multiplications and “ones radaiticn, 


producing, 
kee Leal. 6 
ig) a= a8) tC te) Beenie) 


where, y (i) Woe thewmkEneapproximatiOnmtOmi clam Val UcmnOL lay, 
Cire kj Sismthe, Giak) th elementwotecoctniavent matrix, c and 
x(k) is the kth element of x. Thus the cycle time is 
defined as the time to do one multiplication and one addi- 
tion. This represents the time required by each stage of 
the pipeline to execute the multiplication and addition 
operations. Figure 3.5°* shows the first few steps of the 
MUbCE peat Onl ee elbi there tareeweveUmncm En Sthew iim Liem TR 
filter (in equation 2.4, w is (four) then wecells are 
required to implement the algorithm. Then the total time 
required to complete the multiplication is given by, 


ste 


2(n-1) eto we cimenunlEs 


2(n-1) + 4 time units 3.4 


ry 
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After the first w time units, the Y values begin to emerge 
on the left side and every two cycles we have another Y 


Aknmthws eagure’, y (i) is simply denoted as yi and x(k) 
as xk. 
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valuenylt i nervsSnumber Pofti¢celements®linety then Gweelneed 


another 2(n-1) cycles to compute the remainder of Y, which 


explains the above equations. 


im=comparison, for the shared 1/0. we nécaw i265 time 
UBmItS tO generate the first Y value (4 time units for per— 
forming 4 multiplications and 4 additions‘ and 8 time 
units’ for! 8°transfers') and Tsstimetunits to evaluate suc= 


cessive values of Y, totally requiring, 
Tea (n=) Tse si eebimerunies. 3715 


Comparing the first terms of equations 3.4 and 3.5, we 
observe that for the systolic array implementation, the 
throughput is limited by the time required to do one mul- 
tiplication and one addition. In the shared I/O, it varies 
as a Bunctronrot Ts.ecThe. throughputSis®more mePrhes@shared 
1/O wheneTsii1s less* than®2etimegunits?s iqnoringethe/con= 
stant terms, the overall execution time is also less _ in 


chareawh /O@wheneTs is®smallér@than, 20timesunits: 
3.3. Other Aspects 


In this section, we will compare the pipeline and 
Shared 1/0 organizations on a more general term. We have 


‘Therewaremunetact, only 3 additions») which means wae 
requires less than 4 time units to generate the first ¥ 


value. 
‘Assuming each transfer would take as long as one mul- 


tiplication and one addition, which 1s a rather conserva- 
tive estimate. 


datz “Sh “Vin ew ONE 
<yaq- 26) @f ine emis ®) eaten 
unit # &ne “ghnd 2286s oy bam 2 
-~we sieuleve of edinw oebe e? : 

poiaidped 


. 

7 

: 

ext vez ine Sata Sh + sapien a" 7 


an .®.E. Bre #1E-anoicaup> i¢ — yeadi sia oatsaaae 
sit ,Aoisarreéig#i YRtI6 sfinsate waz sat. 2miy avrende 
<i and 3B OF-betieds emt “od? ye Mezsact at siugtiasio- add: 
eniday at AL Sesame Pry at i0voi pes ato Sivé notsast iq’ ” 
beved2 sd) ni svew ef guyeépiiowl? sat ga? io aetzamul ®: a 
-f0> oat ecttranel. -.estue) ames's Ai 964, Bs et aenw 
ai eeel oats 21-Smis norsucers ‘fates aids seumee . 
~agieip gf2°S. S42 saGieme et e? ‘Genv GNI bard, 


7 
at seqgad setto O48 


rastige diel @ aime’ 
mS 


’ Ss 
are aflideqiq str steams LDew' om \sotsose aldgeat 


oVad 2 .miet Peace szae al stodses\nseic oT —— 
- 


ny on a ‘meee ; vp -y te 
A } : ; A 7 . = 


3.3 Other Aspects 4] 


alrea 
tions 
compa 
softw 


in. 


dy discussed the performances of both the organiza- 
in termSppofmthroughput and ‘utilization. sHere» othe 
HUSOnewiseei le termsnncie data movement, hardware, 


are requirements, incremental expandability and test- 


Movement of data: 


Hardw 


The local values in a program stay within a processor 
in the shared I/O and the non-local values move 
between the processors. In a pipeline implementation, 
they may either stay within a stage or move between 
the stages. For example, in the systolic array 
implementation OfneFlRariritern, extandtYevaltessmove 
from stage to stage. In the shared I/O, Y values stay 
within the processor and X values move between the 
processors. That is, the stages in a pipeline may 
either implement a local segment of the shared I/O or 
a non-local segment. If Tls* is the maximum time to 
evaluate one local segment, then the throughput of 
the pipeline is limited by max(Tls*,Tns*). For the 
shared 1/0, the limit on the throughput is Tns*. 


Thus, shared I/O performs better when Tns* < Tlsx. 


are? 

For each stage in the pipeline, the minimum amount of 
hardware required includes the hardware necessary to 
implement the subfunction of the stage and two con- 


nections to establish communication with neighbor 
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Stages. For the shared 1/0, each processor should be 
able to implement the entire function, and should 
have two pairs of connections: one pair for communi- 
cation to the input and output devices and the other 
POratransferring non-local values. For the pipeline, 
at the very least the first and last stagés are dif- 
E6erentwiromethe rest of the) pipeline. whichis naveusto 
interface to the external input, output devices. The 
processors of the shared 1/0, on the other hand, are 


all identical including their external connections. 


Software: 
The software for the shared I/O is identical for all 
the processors except for the first processor which 
Ccontains#the imitvabization codesesatThe Lirste proces— 
sor initializes all the non-local variables to their 
pre-determined values and begins the execution of the 
functionliby otransierrimg them to the secondiproces-— 
sor. For the pipeline, depending upon the  subfunc- 
tion implemented by each stage, it is different for 


all the stages. 


Incremental expandability: 
Once a function has been partitioned and the subfunc- 
tions have been allocated to the various stages it is 
difficult to expand the pipeline by adding another 
Stage. The expansion would necessitate modifying the 


subfunctions implemented by all other stages. 
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However, Bewiths the shared 2/Osconfiguration, «another 
Pr ecessormcanebe heasily addedritolythesestuucture iby 
reconfiguring the inter-processor communication links 
between the two processors where the new unit is to 


be attached. 


Testing 
From equation 2.7°, we note that, the shared I/O can 
be operated with only one processor. In this minimum 
configuration, a processor is considered adjacent to 
itself. The inter-processor connections of the pro- 
cessor 1S connected to itself. Operating at the 
lowest throughput, this feature enables the testing 
and debugging of the programs to be executed by the 
processors, using only one processor. At the comple- 
tion of the testing, sufficient processors can be 
added to achieve the desired performance level. With 


the pipeline organization, this is not possible. 


Table 3.1 summarizes the above discussion. The last 
entry in the table refers to the need to decompose a loop 
body in the case of a pipeline, so that the partitions are 
executed on separate stages. For the shared 1/0, Such a 
division is not necessary because the entire loop body is 
executed by each processor. Both the approaches have some 
programming. overheads in the form of detecting 
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dependencies between various Statements within a Joop. 
These dependencies have to be strictly resolved in order 
Top Peoguce Correct results.) Analyzangea onogrametor data 


dependencies is not a trivial task [HEUF80]. 


3.4. Summary 


We have compared the shared 1/0 organization to the 
pipeline structures and have discussed a number of 
features that make the shared 1/0 organization attractive 
BOreeGegtain applications. We also identified that the 
Pipeline is more effective when Tp 1s such that Tp < Ts, 
TLrS*.2.Tsi 2 Tns*. We should mention here that we made a 
basic assumption about the operations within a loop fOr, 
implementation on either shared I/0 or pipeline. We 
assumed that all the loop repetitions are identical. In 
some cases this may not be true. For example, if the loop 
body contains conditional statements, then depending on 
the conditions prevailing at execution time, some opera- 
HLonseol the loop, omay Not be sexecuLed = ehOL es SUCheE cases, 
neither the pipeline nor shared 1/0 achieves optimum per- 


formance. 
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3 Comparison with pipeline structures 


1): Values’ 
transferred 


2) Performance 
limited by 


3) Hardware 


4) Software 


5) Incremental 
expandability 


6) Prototype using 
On lya One” “Or0= 
cessor 


7) Decomposition 
of the loop 
body 


Pipeline 


local or non-local 
or both ; 


max(Tls*, Tns*) 


different for each 


stage with one 
input and output 
connections elefe 
each stage. At 
least the first 
and last stages 
are different 

different for all 


stages 


not possible 


no 


yes 


Shared 1/0 


non-local 
Tns* 


same for all pro- 
cessors with four 
connections 

required for each: 


Ze LOLee NDULmmanG 
two for output 
identical for all 
processors[1] 

can be done with 
ease 

yes 

no 


rer ee 


[1] Except for the first processor which contains the ini- 
tialization code. 


Table 3.1 Comparison of shared 1/0 and pipeline 
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Figure 3.1. 4 processors in pipeline configuration 


3 Comparison with pipeline structures 47 


coefficients 


One cell of the systoloic array for the FIR computation 


x 


Y 


One processor of the shared I/O for the FIR computation 


Figure 3.2. Systolic cell and shared I/O processor for 
the FIR filter 
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Comparison with pipeline structures 


1 
[@ 9) 


Amp aGer) 0 x 1 yl 
1s) (ESD) x2 y2 
ABCD x3 y2 

Ag BC : = . 

0 : . 
. xn yn 

C Xx = ve 


Figure 3.3. FIR filter expressed as a matrix 
multiplication 
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Figure 3.4. Systolic cell arrangement for FIR filter 
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Step 0. During the first three steps, x1 is moved to P(3) 
and x2 to 1 as part of the initialization. Y is 
initialized to zero. 


1 2 3 4 


Step 1. yl enters 4. A enters 4. x1 and x2 move right to 4 
and 2. 


Step.2. yi moves; left tos3. B enters: 3..x2 moves right to 3. 
x3 enters 1. 


Step 3. yl moves left to 2. C enters 2. A enters 4. y2 
enters 4. x2 moves right to 4 and x3 to 2. 


yl=y1+C*x3 
y2=y2+A*x2 


Stepedwey!] movesseleftetond.tD enters 1. B enters 3. y2 
moves left to 3. x3 moves right to 3. x4 enters 1. 


y l=y1+D*x4 
y2=y2+Bex3 


Step 5. yl is output. C enters 2. A enters 4. y2 moves left 
to 2. x4 moves right to 2 and x3 to 1. y3 enters 4. 


y2=y2+C*x4 
y3=y3tA*x3 


Step 6. y2 moves left to 1. D enters 1. B enters 3. y3 
moves left to 3. x4 moves right to 3. x5 enters 1. 


y2=y2+D*x5 
y3=y3+Bex4 


Figure 3.5. First seven steps of the FIR algorithm 
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CHAPTER 4 


Inter-Processor Communication Mechanisms 


An important consideration in the. design of shared 
I/O is that data transfers between the processors be com- 
pleted in asgshort artimebispanieas possiblex, Since =the 
Sends and receives occur’ on a time scale comparable to 
instruction execution time, if a large percentage of the 
processing of a loop repetition is spent in supervising 
the data transfers, the performance gained by the parallel 
execution of the loop repetitions may be lost. Also, Tns* 
includes the time expended in the data transfers (see fig- 
ure 2.2). Thus by equation 2.3' any reduction in the time 
required for the transfers would directly result in the 


enhancement of the throughput. 


The overheads involved in data transfers are in the 
form of observing tight synchronization between the sender 
and receiver. In the shared I/0, processor(i) sends 
values to processor(i+1) and receives from processor(i-1) 
(Seenrtigurem.2).)eTorallowspropersoperatiom, litte must emibe 
ensured that P(i) does not send a second value to P(it+1) 
before thesbatershas ready the) sfarst?: Sami darl ye © Pitt) 
~ ‘Bquation 2.3: R=1/Tns*. 


2It is assumed that there exists a buffer of size one 
between the processors, P(i) and P(it1) 
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must be prevented from receiving from P(i-1) when there is 
nothing to receive. This could be managed in software by 
incorporating a busy loop in the program. Once within the 
loop P(i) would repeatedly access the status of its neigh- 
Dore and act, accordingly: either walt. on go-melnis intro 
duces an unnecessary processing overhead as a result of 
having to execute the busy loop even when the cir- 
cumstances “are such that P(i) does not have to wait. G1f 
implemented in hardware, such a loss of precious time can 
be eliminated. That is, we could dispense with the delay 
of sending if the previous value has already been received 
and the delay of receiving if the next value has already 
been sent. In this chapter, we describe two hardware 
designs to handle the inter-processor communication as 


required. 


In the first, a hardware unit maintains a buffer size 
Onepetroltholde (thes data-senteby P@p)tetThe butfertisteons 
Sidered full when P(i) sends a value to it and empty when 
P(i+1) reads from it. The buffer full, empty conditions 
are termed as exceptional sonanatone! P62) tise forced inte 
a wait state® if it tries to send a data when the buffer 
is) fide Mee is!) released fromm CheGwattvstate whensvhe 
buffer becomes empty. Similarly P(it+t1) is put in a wait 
State when it tries to read from an empty buffer and 


7 In a wait state, the processor waits and does noth- 
ing. 
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beleasederronel wevhen the ebirtenesurneetu ll. Figure 4.1 
illustrates the above protocol. Note that this requires 
no knowledge on the part of the software, about the avai- 


lability of the buffer for a data transfer. 


The second design is a more general version of the 
first where there is a buffer of size more than one. If 
Enemsize S1Senwethen hit (permiestat themmostean sends from 
P(i)}) to take place before any receives from P(i+1). Once 
the DULLew@eLSmerulLNe Titalblawsiinereceivesm fromnmeP( iz hisito 
take @place o®before®any send “from iP Ci) setinithe following 
section we describe the processor selection. The primary 
requirement for a processor is that it be capable of 


entering a wait state as explained above. 
4.1. Selection of a Processor 


The processor has to be selected in a such a way that 
it permits analysis of different approaches without unduly 
increasing the cost factor. It must also be flexible 
enough to allow examination of different algorithms to see 
if they are feasible for implementation on the shared I/O. 
Forsthe later purpose, it is advantageous if the processor 
iseUorognammable that 1S, it canbe wused under Sescetcwane 
control. A microprocessor easily satisfies these two con- 
straints. The use of a microprocessor not only allows 
software reconfigurability but also, at the current market 


prices, gives a smaller cost to benefit ratio. 
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Depending upon the techniques used to interface to 
Phegpenipheral.devices*,.microprocessorsecan-berclassitied 


into three categories: synchronous, asynchronous and sem- 


isynchronous®. 


A synchronous microprocessor interfaces with devices 
of matching speed of operation. That is, the data 
transfers between the CPU and the peripheral take place 
within a fixed period of time. At the end of the period, 
the CPU assumes the successful completion of the transfer 
and proceeds to execute the next instruction. MC6800 from 
Motorola is an example. Such a processor is not suitable 
for our purpose because on exceptional conditions, there 
PSanoO Methodsoft Pindicatingusto sthes CPUsethatesthe data 
transfer could not be completed and hence the CPU must 


walt. 


Asynchronous microprocessors monitor the addressed 
peripheral and automatically enter a wait state if it does 
NOEsrespond within» a Specific (time perzed. For wnstance, 
in MC68000, the peripheral conveys the completion of the 
Gdaivambranster tOmtne CPU  eLNGOUCG iam cammeDacd Transfer ACK- 
noOwLedgem DIACK) = line.) «AS uelOnGm@aSm abn S =: Linemrcengar 
activated by the peripheral, the CPU waits by inserting 
extra clock cycles in the current machine cycle.) This 
~*These include memory as well as I/O devices. 


>This classification is based on bus communication 
techniques discussed by Thurber, et al [THUR72]. 
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conforms to our requirement that the processor must be 
capable of entering a wait state when necessary. However, 
the one disadvantage is that sufficient logic must be pro- 
Vicedim@to) generate the DTACK signa ltonm aj) datastransters 
including those that do not spring an exceptional condi- 
tion. This logic would be in the form of a timing device 
that would activate and deactivate bites DIACK  SSiqnal wae 


proper instances in time. 


Semisynchronous microprocessors operate just aS  syn- 
chronouS microprocessors as long as the peripherals are 
faster than or as fast as the CPUS. Nevertheless, they 
include a provision by which a slow device, if necessary, 
could force the CPU into a wait State. For example, the 
WAIT line on the Zilog's Z80 CPU is used for this purpose. 
When the WAIT line is active, the CPU enters and remains 
in a wait state. Note that this is in reverse of the 
operation of DTACK line on MC68000 which when not active 
Apeae Darticulan Minstant Sinetine, ssorcessthes CPUsintoma 
want Statemmror our design ,=it meanss=that ithe scircuitny 
must include enough logic to generate the WAIT signal only 
on exceptional conditions. That is, the timing device 
necessary for MC68000 type microprocessor would be absent. 
Wibhouteinvestigatingminto more detallsjmelet us State thar 
we opted for the 280 type microprocessor. Later in this 
chapter, we look into the use of asynchronous microproces- 
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Apart from the 280, Intel's INTEL8080A, 8085, MCS6500 
from’ *“Signetics,°CDP1802°by RCA and IM6100 of the Intersil 
from the 8-bit microprocessor group offer a similar WAIT 
Pinewraci lity «[OSBO78 PY" "MCS6500 does nottallow thetuse oct 
the WAIT control line during write cycles which is very 
bimeemng*@for sour applications. Prom Chem G-bit group 
INTEL8086, 28000 and General Instruments' CP1600 and 
TI9900 from the Texas Instruments have a similar feature 
[OSBO81]. On CP1600, the maximum length of the wait state 
is limited to about 40 microseconds by the CPU to keep its 
internal registers refreshed which indirectly affects the 
program size. Apart from these two, any other from the 
above set could be selected as a processor. We used Z80 
due “to ents *popularity as a S=bit microprocessor and “its 


avallability in our laboratory. 


For the data transfers, we selected the parallel com- 
munication mode in which the data is transferred in paral- 
lel as a complete entity. This mode of operation is_ the 
most Suitable for the shared I/O organization because the 
highwbandwidth of the parallel bus allows the stransierns vo 
take place in the shortest time [GABL80]. We used the 
Parallel Input, Output (PIO) unit sot wthe 280) microcomputer 


family to manage the parallel by word communication. 


The following section describes the operation of the 
PIO. The rest of the chapter deals with the development of 
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4.1 Selection of a Processor 56 


during data transfers. The first, called the HOLD con- 
Evollemprmakes Uusewo& thehuegrsterahenternamacvontheaeresto 
HOunpiaesbubienpehohiasize tonePlto@unold Bthe adatarecoibe 
Sopansteuned. In the “second design, first-in, £irst—out 
(FIFO) buffer of size 32 is used to contain the data to be 


transferred. 
4.2. Operation of the PIO 


Figure 4.2 shows a box diagram of the 280 PIO unit as 
Conmec te = tome i) surAmand aa recmc nem ly Omm pOLtes ae in gen— 
Gta, eeDOLN sical stunction as either, input or outputs port. 
However, in our design, port A is configured as an input 
port€fand port C as an output PCR Gach pChLaias @omredise 
ter to hold the data to be transferred. Associated with 
each port, there are two control lines termed as ready 
(RDY) and strobe (STB). These are also referred to as’ the 
handshake signal lines. RDY is an output Signal indicating 
whether the port 1S available for a transfer of data. It 
TSeeaieeacthive-ni¢hwsigna ls STB 1S an input signal gen- 
evaredepy Pini) OmiP(iti)e FoenepcntaGyees us ans Bused@ to 
gate the contents of the port register to the inter-port 
datawaus.. POD PCC MmA,MilturiSeuSeO tOmngate serie MmeintenoaorL 
data DUS) )tO. EhempOLtmLeguStec. eo Deis dledcunve.vowes! ¢— 
avenlin 
— *There are two PIO chips on the PIO unit with a_ total 


CimECOUL Me DOLGS amrA ED AeCr-and =D. eCeanaeD are the same as A 
and B except that they are on different chips. 
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4.2 Operation of the PIO oF 


The next section lists the special terms used in the 
remainder of this chapter. The subsequent section 
describes the behavior of the PIO during send and receive 


operations. 
4.2.1. Terminology 


Special terms introduced in this chapter are 
explained below. Some of the terms may be found only in 


the figures. 


P(i) processor i 
CPU Z80 CPU 
RD Read signal from the CPU. When low, it 


indicates that the the peripheral 
should place a data in the system data 
bus. 


WR ; Writessiqnal LuomethemcCeUmeawhen | low, 
it indicates that the system data bus 
contains aavalid datastonbevoutput. 


IOROQ I/O request signal from the CPU. When 
Tow, @ultTsindreates® thateithe  ssystem 
address bus (lines A0O-A7) contains a 
valid I/0 device address. The RD and 
WR signals are used in conjunction 
with LORO. 


WAIT An input to the CPU which when low 
forces thenCPuminsert additional clock 
periods in the current machine cycle. 
WALT (i) BS GheBwautesigqnalstor Bll 


pC Tele) Ports A and C of the PIO connected to 
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4.2.1 Terminology 


ARDY(i), 
ASTB(i), 
ECRDY( 10); 
CSTB(i) 
CE1 
CE2 
A/B 
GAD 


T1, T2, 
T3, T4 


Tw 


STD bus 


PIO bus 


HOLD bus 


FIFO bus 


ms 


us 


ihescecvegsiandchat eu igcom ole poats 
A(i) and C(i) respectively. 

Chip enable 1. selects the PIO chip 
GOnrainingtpert “Ae 

Chip enable 2. selects the PIO chip 


conta iningppoutsG 


selectsipont AforgB weehineayPiOechip: 
Low slOr porteA. 


Conerol oOnmcdatareuOwelovedacas 


Clock periods in a machine cycle used 
in the timing diagrams. T1 appearing 
after T3 ssrqnifiessthe sstart, sor) .tne 
next machine cycle. 


Clock period during which the CPU is 
waiting. for the I/O operations, one 
Tw 1S automatically inserted by the 


CPU. 
System bus which includes the address 
lines, data lines and the~ control 


lines of the CPU; 


Bus that connects ports A and C. this 


includes the handshake lines. 


Bus that carries the signals generated 
by the HOLDecontreller: 


The bus that carries the signals 
erated by the FIFO circuit. 


gen- 


The signal is turned high. 
The signal is turned low. 
Logical AND operator. 
BogigakiOR operauors 
Logical negation. 
Milliseconds. 


Microseconds. 
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ns Nanoseconds. 


4.2.2. Data transfer operations using PIO 


The send and receive operations are implemented using 
EiemmOUL ancaeiN Simstructions of thes CPUs m1guces4.c shows 
the timing relations between various signals during the 
execULlOny phase Of sthese “1/0 instructions seeDuLing sia, 
the address of the peripheral becomes stable. For an OUT 
imstruction, the data to be output also becomes stable 
at about this time. The IORO and WR or RD are activated 
in T2. For an IN instruction, the data is read at the 
falling edge during T3. Just before T1 of the succeeding 
instruction, IORQ and WR or RD signals are deactivated 
to signify the completion of the I1/0 operation. One Tw is 
automatically inserted by the CPU during these instruc- 


SLOnS. 
AJ, Send: OUT) (nn )meA 


The OUT instruction when executed by P(i), outputs 
the contents of the register A to C(i). nn is the address 
Ciecl era quren4 74 shows then icimingmere lat roncee between 
the various PIO signals during the execution of this 
instruction. CRDY(i) is set by the PIO at the falling edge 
of the Tl of the succeeding instruction. This denotes that 


the inter-port data bus contains a valid data for 
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transfer. CRDY(i) remains high until a raising edge is 
eevected ons€STB(i)lovAt this imstant it 16@resetieby isthe 
DPOGErOS indicate ithabythere®is notvalidadaraminatue poge 
OuUrpUL FequstérstTheSraisingnedge on the CSTB(i) “is Ggen-— 
eretedmby the erecerver, (~P(i4+ 1) eeattersit hasereadetie 
pore. It the CRDY (2) ie" high When) the anseructionvic sexe— 
euved, Cthen®Gite@pis#tireset by the PIO at the falling edge 
duping Tw. It 1S maintained in that State umt1l the = fal 
Ping eedge “during Ti of ‘the next instruction. Thiscmis co 
ensure that the CRDY(i) is low while the port data is 
changing. It also ensures that a positive edge is gen- 
erated on CRDY(i) whenever an output instruction is exe- 


cuted. 
4.2.2.2. Receive: IN A, (nn) 


The effeot of this instruction 1s to cause the con- 
tents of the input port register to be loaded into regis- 
ter A. Figure 4.5 displays the timing diagram for the PIO 
SiGuailceeOuGINGd s) tniS operat vom .ss Ine ssendIngmce ce mr Wie. 
ijeeexanines | tnesARDY(1) “linepeubetoresmit sends any 
valid data Omer ite) If found high, it resets the 
ASTB(i) line. This transfers the data present on the data 
buseto othe input register “of JA(1) AL che end ole the 
PranstereathemAGTB (i) is set by Pidl=1) 2p loi Geerese smecie 
ARDY( ie dusabling! tthestportmeutior Lurtheredata transterse 
When an IN instruction is executed by® then P(i);" atcethe 


falling edge during T1 of the succeeding instruction, the 


au pee | 
_— nd 

ads 4 aa sx tea 00 Ge 
st0q-sd2 i stab BERRY 0: oq a4 & S tet2 stenteed od 
~iep ai (ares adi o0 Spb¢ Bittaiot SdT serie " 
oda Been ani st adais OFS i a ons ga 
“see 2f Géltoagens Cr suite oo #t 1. seqRS att it 
aphe qntile? #42 46 tf any eet ai vi nats 
=he2 ‘eda fling e76%@, praia) ab | 


. _ 


re bu ai #8 a 


gi <2i aicdT uo ta sus abhe shen of fo Ty patage etm 


ai -ef45. itoqg, od7 “Sliny ‘wal! ss ())2GRD ef? cad 
‘m a 
~fisy el p68 svicieed' es Tels eswens Gele 71 ‘only 


--ge Gi NelTrcltssel tQiue AB 4svanede (2) ¢Rhy nee 


‘nt) AAI ev ieclen jh 


-nda ef3° 94ten 92 a4 not Inwtidaaxy eats) so “ppeae ~~ 7 , 
~2ad: G40t. Bebael eo) 07) 197 RI Get sAeg juga? sat fe) #: 193 
Oi% ad? 252 merged opimis sna apa b coe SkGgcR. wh 393 
-')@ .ssiu0b of (Biree eit not seo aint extses @6 igi, 
qi6  28tse’ a2) stofad shit ‘tttaaa a3 TT sft 
gid aseeerlcl aithi Sauph 5 tia 62 saab 


a3 2inv oi iy 


ateb 47 oo dnges7q e286 a 


a> 2 Fn ie #h../ 2K jo° Ess 


i 


ods cdseey ald a a 648 bas 
gens apa: poi gs 


4.2.2.2 Receive: IN A, (nn) 61 


ARDYAI9) 1S ‘SetSby “the @PIO ‘enabling the porting Sehe 
ARDY(i) line is found high when an IN instruction is in 


progress)" isp keptelow by®thesPiotpntil Pthe Srallings@edge 


Cum TiPols nextoanstruetion. 
4.3. The HOLD controller 
4.3.1. General operation 


From the description of the handshake lines of the 
PIO we can deduce two things: after the receiver has read 
the previous data, its ARDY is high and when the sender 
has sent a data to the receiver, the ARDY of the receiver 
is low. If the HOLD controller monitored these two lines 
at the time of transfers, it could easily regulate the 
message traffic. ARDY(i+1) will be low when P(i+1) is not 
ready to receive data. If at this time, P(i) tries to 
send another data to P(it1) the HOLD controller will place 
a low on the WAIT line of P(i). This would prevent P(i) 
from completing the OUT instruction and keep it in a wait 
state until ARDY(i+1) is set. In the same manner, ARDY(i) 
womd@abemhigheas P\i-d)Uhasenor sent@any data to =P(i)- ert 
Piijemaccempts,) to. read jltcy GnpUt sport Ledicle satu. 
time, it will be forced into a wait state by the HOLD con- 


trollere 


The HOLD controller should be able to recognize 
the attempts by either processors to send to or receive 


from the other in order to generate the necessary 
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Signals. Thus is made possible by including some 
decodingtstognee that tewll lheproducewea vesiianahd whem Pine 
corresponding port 1s addressed for data transfer. 
AMeOUR Signalmes Genérated ifeporiiC his taceessededurimg san 
QUDSei nS t ruciion weeSimi lanky saan sDNies1gna Beasmoroduced Lut 


port A is addressed during an IN instruction. 


OUT: 


WREY LORORVECE2 lv a(C/D) vw GAY) 


IN! 


RDVy TORO vViCHll ye (G/D)e (A778) 


We observe that OUT' or IN' will be low when an OUT or IN 
instruction 1S executed addressing the particular port 
involved. Then the WAIT signal for P(i) on send is- gen- 
erated as follows (recall that the WAIT is an active-low 


signal): 
WAET (2) 9=2OUT VaveARD Yaa) 


In the same manner, the WAIT signal for P(i) on receive is 


generated as follows: 


WAnT Ci) = IN ARDY (a) 


Combining these two equations, we get, 
WAT (i eee OU Dia ARDY, Girl )))) ergot NimevecAR DV 15) sm) 4.2 


Due to some complications encountered during imple- 
mentabionpeequatione4,2°canmot b@husedsasertersr(thicgwil 
be described later). Two signals, SEND(i) and RCVE(i), 


are introduced which at any time define the state of the 
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dave ttvanstérshion PUR). reveklowe’ én P-SEND Ci )Nemeansetthat 
P(i+1) has not yet read the previous value and thus P(i) 
cannot initiate another send. Similarly, a low on RCVE(i) 
eefersertos thenetaceathat there Hsimothmmgqrbolrecemve and 


henee<eP (i imust lwatle. (Thus teqtationl. 2ntstands!) modified 


as, 
WALTX 1) = "(OUT “v SEND Ci) 9x GIN Oey REVVED a3 


Balter tereads port. A(i)eseresetsmeltom RCVE ante 
indicate that the port is empty. It also sets the SEND(i- 
eit Pi )e so, the later gecanm execute another arsend 
InStructlom., in the Same fashions. abter tt) has. sent 
amcatas estos Piit 1), reSetss) SEND(2)) lt salso. selscumcne 
ROVE ity Ine I) Indicatinguathatma Ci) mice UL iesomthat 
the later can do a receive. Figure 4.6 shows the above 


DoeoLocol anvanm algorithmic fashion. 


In the implementation, an equivalent form of equation 
4.3 is employed so that the available components could be 


wseac) for tiabrreat ion. 
Writ) = [4a cCOUT eA SEND (0) i eT NB ROVE ies) eter tad at gil 


Apart from generating the WAIT signal, the HOLD con- 
troller is also responsible for transferring the data sent 
by the sender to the receiver. This is necessary because 
the HOLD controller bases its decision to stop P(i) from 


completing its receive instruction on the signal level on 
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REVECG RCVE4 iNOW st) reset tSbylp (1F9) hath ene end@ott the 
send operation as we explained earlier. To accomplish the 
Cranslers Of8Gdetatl frome senderetOae receiver, -ASTBAoL the 
receiver is generated by the HOLD controller of the 
sender. This loads the data from the inter-port data bus 
on tO port A of receiver. That 1s; HOLD controlier of Pi(7) 
generates ASTB(i+1) to transfer the data sent by P(i) to 


Poet) 


Thus} thePHOLD: controller®ots P(ifi generates six ‘Sig- 
nals; OUre TN" OPSENDAG) "22 RGVE(GDee BASTB i tee and 
WAIT(i). Figure 4.7a displays the timing relation between 
thesewasignals Sduring?>an SOUTPRinstruction:? Herelm@ite is 
assumed that there are no exceptional conditions. The 
CRDY(i) is monitored by the HOLD controller of P(i). The 
raising edge on CRDY(i) signifies the completion of the 
output instruction and the presence of valid data in the 
inter-port data bus. This is used to generate the 
ASTB(i+1), which transfers the data to A(it+1). The raising 
edge on the ASTB(it+1) sets RCVE(i) indicating that P(i+1) 
has» a) data’ to receive. SEND(i) iy REE by the raising 
edges one sOUT “=e Notre “that WAIT(1) remains inactive 


Enopougnoutstenesdatas transter. 


Figure 4.7b shows similar timing ‘relations for the 
Gase ofean IN@instructwon "Herel also the absences’ oft excep- 
tional conditions is assumed. The raising edge on _ IN' 


indicates the completion of the input instruction. This 
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BSotused S10 Sreset RCVE (i) sto disable @eP Cit Erome executing 
another receive. It is also used to set SEND(i-1) so that 
P(i-1) is enabled to do another send. The WAIT(i) is not 


Sevecrededuning= the transter. 


Pigures#4.6a,;b) tilbustrate the fcases evitere Pthe remus 
an@mexceptronalmecondit ion Nelmnt tour or4esariet HeRWA PTC vers 
reset the moment OUT' is generated because SEND(i) is low. 
When SEND (7 )ierseisepibyeP (tis 1) SCE tqurel4. 75) wewADT Gi eis 
setvby the HOLD controller releasing P({i) from the wait 
state. The send operation is completed and the data is 
Senobedwinto Ati+1). Similarly, asvshown impeibigure —4e8b, 
the WAIT(i) is properly generated when P(i) initiates a 
receive when RCVE(i) is low. P(i) exits the wait state 
when RCVE(i) is set by P(i-1) after it has sent a data 
(idigures4 jal). frrqure 4.9 showstthe finalPicireuttry tash it 


was built and tested. 


From the above discussions, we observe that the only 
Wexternal™ |signal, HOLD controller uses, 1s ether CRDY (i) 
produced by PIO of P(i). The “HOLD = controlier Suses this 
signal to generate ASTB(i+1). However, a close examination 
Gtmeiigure:4./a will) reveal thataethe generation of 
ASTB(i+1) could have been triggered by the raising edge on 
OUT!) This means that the HOLD controller could havevactu-— 
ally been built independent of the PIO unit and attached 
to the CPU as a separate peripheral device. The ASTB sig- 


nals, though are part of PIO, would be Still necessary to 
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serve aS latching signals in a design that does not use 


the PIO, 


The next section discusses some of the problems 
encountered during the implementation of the HOLD con- 
troller. Beginning with equation 4.2, it shows how equa- 
tion 4.3 is derived. Though the problems were not very 
diiticultitoysolve, ttheiritsolutions tdtdhleadito sagedesian 
that could function independent of the PIO as we explained 


above. 
4.3.2. Problems and solutions 


Let us assume for the time being that we are trying 
to implement equation 4.2 as it is. Suppose that P(i) has 
just completed a receive operation. Referring to figure 
4.5, we see that ARDY(i) is set by the PIO at the end of 
the execution of the corresponding IN instruction. Let us 
also assume that P(i) is initiating another receive opera- 
ton. IN’ wills be low afd since ARDY (i) 1s high, by = equa— 
tion 4.2, WAIT(i) will be reset by the HOLD controller. 
This forces P(i) into a WAIT state. P(i) should continue 
Pomebe sine the wait state Until lVARDY Wijmiesresetmbyer iia, 
after it has sent the next data. However, P(i) does not 
remain in the wait state as expected. As shown in figure 
4.10, P(i) is prematurely released from the wait state 
because, the PIO resets the ARDY(i) line - ARDY(i) becomes 


low - during the execution of an IN instruction as 
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explained earlier (see figure 4.5 and section 4.2.2). 
ThivseresultsuineP(i)ereading thes spreviouss datalsas) emany 
timesPas INSrsvexecuted!by P(i) before themnexe validudata 
anwives #hromer uae The Ssendersdoesenotmsultcummeruommma 
Similar set back because it makes use of ARDY line of the 


beCenver rather than the -CRDY cf “es own PIO, 


To circumvent this situation, the RCVE(i) signal is 
inbroduced in place of ARDY(i)= RCVE(1) 18 made to reflect 
Ene Stabe ObMPhewinput sport at anys time: ashighwimndicauing 
the presence of valid data and a low representing an empty 
PORUMERCVENGs) @ismseteby b(i=() Matter bmeSstLObesue in umthe 
data -uSing ASTB(i). RCVE is reset by P(i) after it com- 
pletes the receive operation. Thus RCVE(i) removes’ the 
circuit's dependency on ARDY(i) and prevents P(i) from 
exiting the wait state inappropriately. However, this 


arrangement introduces another problem: deadlock. 


Suppose that P(i) begins the execution of an _ IN 
instruction when RCVE(i) is low. The HOLD controller will 
promptly force P(i) to enter a wait state. P(i) will be 
mavntaineds @neewait State until RCVE(1) “1s set bye Pi 1)- 
BULPULHEMARDY (1) Mliner1Sestill Tesetebyuthe Ploy eandmerepe 
low until the end of execution of the IN instruction. Now, 
meep(a=4) triesetowdosa send, it will be blocked trom com- 
pleting it because A(i) is low (figure 4.11). The result 


is deadlock. 
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4.3.2 Problems and solutions of 


Theieensstehotegdeadlockr#canshbes i preventede vif? the 
sender's direct dependency on the receiver's ARDY line is 
PEMOVEALVOTHEESEND (Me hsignalleis@@introaducedu tommreplade 
ARDY(i+1)e® an tequation®’4.2.: eA high on SEND(i) means! that 
A(it+1) is available for another data. A low indicates the 
Oppositerwe SEND(ije@susetaby P( Ti) yeaftert itthasdreceived 
the previous data. It is reset by P(i) after a successful 


completion of a send operation. 


RCVE and SEND signals could have actually been 
integrated into one signal for the HOLD ponceo mens One 
Signal would have sufficed in this case because the send 
and receive operations are not permitted to occur at the 
same time. However, in the next design, as we will see, 
they can indeed occur at the same time and hence a single 
Signal would have introduced a "critical section". Just to 
make the generalization easier, we introduced both RCVE 


ana SEND in the HOLD controller 1tseli. 


A power-on reset circuit is included that will ini- 
tialize the SEND and RCVE signals to high and low respec- 
tively Seatepower-onetime;ethis addeowsumthéestuninterrupted 
execution of the first send operation. It also ensures 
thatetarst veceive instructioneasy blocked tanonedatarenas 


been sent by the sender yet. 


There is one problem the current design does 


not address: the 2-80 available in the laboratory uses 
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dynamic RAM which requires a refresh signal every 2 ms. 
The manufacturer recommands a time gap of 1 ms between 
SUGCESSIVEGrcEneshecyctesmethis méans,gthe CPU cannot ebe 
HeddtainsaGgwadtestate sforrmonuer thanmmimsesinceathnesrersesh 
cycles are generated by the CPU itself during the course 
Pinelnstnuctbnons execut lons- However, it should be men- 
tioned that a period of 1 ms represents, on the average, 
She@eecxeCcution timewot about Sido 2Ze0cinstouctions. kEfuthe 
programs consist of more than 100 instructions, then it is 
recommended that the dynamic RAM be replaced by a static 
RAM to avoid loss of data in the presence of extended wait 
States. The loss of data can also be prevented by includ- 


PiGmasretLLeshecircureny. 
4.3.3. Technical details 


Faqure =4.9) 1s Sthe Scircurtmeclagraml OL HOLD) Con] 
troller. The address decoding is done by 7485 comparator 
and 74LS01 NAND gates to produce CE1 or CE2, A/B and C/D. 
74LS05 hex inverter, inverts these signals and IORQ, WR 
Signals. These are input to 7421 AND gates to produce OUT 
and IN signals. 74121 monoshot generates ASTB(it+1). Note 
that the ASTB is generated as an active high signal. The 
inverters on the PIO unit converts this signal into an 


active low signal. 


The positive pulse output from the 74121 is 160 ns 


long which is 10 ns more than required for strobing the 
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data bus onto input register of port A(it1). The SEND and 
RCVE signals are derived from two flipflops (74LS73A). The 
Davin gmedge ASUB Gmimis@used to set fhe aRCVH) tl ip loo. 
@he falling edge on IN triggers a monoshot, that Generates 
anGQlear pullsegiom thesgROVvEeilipplopsesimilarly, ‘the  fal— 
Jang edge on IN from the receiver; P(iti) , ueeused to ser 
Ene SEND fliptiop. The falling edge. fon "OUD triggers: 7a 
monoshot that clears the SEND flipflop. (see figures 


Ae Cae OD) anda 4.8 Cayib). i. 


74123 is used to generate the clear pulses for the 
SEND and RCVE flipflops. These clear pulses are 100 ns 
long. This means that these two signals will be low for a 
periodreof £9100 ons e*fhrom*the nonene they are cleared. The 
next processor should not be allowed to set them during 
this period because, when the clear pulse is active, the 
flipflopseignore all the inputs. /A close examination» of 
them timimg#@tdiagrams ofe477(a,b) andes. 8(a, b)Mwill ireveal 


thatesuch assituation iwill@nevervarise: 


For a more complete implementation, 74LS76A_ should 
have been used in place of 74LS/73A because the former has 
a preset facility whereas the later doesn't. The preset 
and clear features are required to implement the power-on 
reset circuitry. The reset signal from the CPU can be used 
to set the SEND flipflop and reset RCVE flipflop at the 
power-on time and at every manual reset of the entire sys- 


tem. Since 74LS76A and other flipflops with preset and 
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clear features were not available at the time of construc- 


tion of the HOLD controller, the power-on reset Lacs iy 


could not be implemented. 


4.3.4. Testing of the HOLD controller 


The configuration in which the HOLD controller was 
Eested,) consists tofistworzs80 tmicroprocessorsuwithutheirePle 
UPPES SIinterconnected:«CUt) isteonnected *foeA(2), ands cla 
is connected to A(1). Two copies of the HOLD controller 
were built; one was attached to each processor. Two 
methods of testing were used. In the first, called the 
static testing, the generation of appropriate signals were 
monitored for operator assisted communication. The program 
in each CPU, waits ina loop until issued a command to 
send or receive, from the terminal key board. Depending 
upon the prevailing conditions, the processor is prevented 
from “completing the instructions when the buffer is full 
or empty. In the second, referred to as dynamic testing, 
the programs are let run freely without any operator 
intervention. One of the processors is deliberately made 
to be slower than the other by including extra instruc- 
tions. The WAIT line of the faster processor 1S examined 
for the presence of wait signal during the execution of 
the I/O instruction and the length of the wait state is 
observed to be equal to the execution time of the extra 


instructions on the other processor. Figure 4.12 shows the 
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4.3.4 Testing of the HOLD controller bes 


program used for the dynamic testing. 


The sender, after setting up the PIO, begins to send 
continuously at an interval of approximately 10 us. The 
peceiver receives a character approximately every 162) us. 
The large time delay in the receives is obtained by 
including extra instructions in the receiver's program. 
Fifteeneinstructionss MEXe(SP)g@axk)eeach requiring: 10. 358us 
were included to introduce a delay of 155.25 us: Thus (the 
delay between successive receive instructions is about 
164 us (including the time required to do a branch). This 
makes the HOLD controller associated with the sender to 
force the. sender into a wait state for approximately 
154 us before every send excepting the first one. In the 
Same fashion, the sender was made slower than the receiver 
and similar results were obtained. Figure 4.12 shows the 
program only for the case in which the receiver is slower 


than the sender. 


The next section presents a design where a buffer of 
Sizeusmore thane one 1S Utilized] The Am2312)eapfinstoin, 


first-out (FIFO) buffer is used for this purpose. 
Ap Ane ThesFIFOe circuit 


The HOLD controller described previously represents 
data transfers between adjacent processors using a buffer 
of size one where the buffer was part of the PIO unit. In 


this section, a mechanism using external buffers of size 
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4.4 The FIFO circuit Tes 


noxegoham are tis: presented. iTheweeiG Geos stisll Batsed to 
interface @itc EehesiCPuU PiThati ris Jethaceep stiWkioutoues ta 
port Cand inputs from port As However, the FIFO« buffer, 
andepthe Fassociatediiccircuitryet handletthe routinagirotsithe 
data from the sender to the receiver. The FIFO buffer 


Used AiIAM2ZS Teymecamicontainmgup, toms 2ebytes gofteidacar. 


The next section briefly describes the Am2812 fol- 
lowed by a discussion of the FIFO circuit. The last sec- 
tion describes the differences between the FIFO circuit 
and a method of using the private memory associated with 
P(i) to form an internal buffer. In the discussions to 
follow, a slot is used to represent one internal unit of 
the FIEO butiemsthatmrcan@ehcld tone Fibyre groftiitdatasae tor 


instance, Am2812 contains 32 slots. 
4.4.1. The Am2812 


THe PAmMZs'| 2rcaneholiad’ up eto i332 eby tes tohudata wOnce Biche 
fursteesloUoe sm.filledy up, Setherdataernipples ghaoughgthe 
buffer towards the output side and occupies the vacant 
Sloteeanext bto @ramrullEslot Rll Fthetbuiite wm swemmty,, gunen 
thes Gatastalws throughethesbusier toethemoutput’ pins. Eu 
takes 10 us for the data to ripple through the buffer from 
ther input is tde stomthewlourputieisides @uiten gathem® bubier girs 


empty. This is referred to as the ripple through time. 


There are two Signals associated with either end of 


the buffer. IR (Input Ready) is an active-low signal which 
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indicates if the first slot is empty. When the buffer is 
FUE? lRewil Tbe haghee PL (Paral lelmtoad) scicnalloscs the 
Gave spreseiteaom theminput ping onto sthe firsts slot. ) OR 
(Output Ready) indicates that the output pins contain a 
valid data. When the buffer is empty, OR remains low. PD 
(Parallel Dump) signal dumps the contents of the last slot 
ONECOeEhesouLrpuly pins. This sactionmmis 6 tollowed bya ecne 
shifting of the data inside the buffer towards the output 
side by one slot. PL, OR and OD are active-high signals. 
During the transfer of data either from the input pins to 
ENewlursteslOtson trom the mlastwslotstOntnces™s CULDULMEDINSe 


the corresponding ready (IR, OR) signals remain low. 


Out of these four signals, the FIFO circuit must be 
able to generate PL for loading the data sent by te sender 
imgosthewbutter and PD@ior readings themlast slot to spno- 
vide data for the receiver. Figure 4.13 shows the timing 
relations for the Am2812. PL 1S a positive pulse of width 
at least 100 ns. It 1s to be generated after the data on 
EHeMinpuUt pINns have Stabilazed, “Similarly, PDeissa mposns 
Givespulsesotewidthe100 ns.) ThesPDeshouldebe presenteatcer 
the input port of the receiver becomes available for 
another data. The FIFO generates the PD after the receiver 
pic) jeseve) ets siquelbie jelebwie Gio isleeke ielets (levies) Ele ells ewe Glee 
ieee OuEEDULTermcone De LOULCds LOMUNo Us DUT DOUL at sOlmune 
Stanteo lePD al ietakesSeaDouc 900 ns for the data to move 


from the last slot to the output pins. Similarly, after 
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PL has been generated, it requires about 550 ns for the 


first slot to become empty after the loaded data has 


shifted towards the output side. 


4.4.2. Operation of the FIFO circuit 


Figure 4.14 shows a block diagram of the FIFO cir- 
Gute. The operation of the FIFO circuit is very similar 
to that of the HOLD controller. The SEND and RCVE signals 
determine at any time if the initiated operation, send or 
receive, can be completed. A low on SEND(i) means that the 
first slot of the FIFO buffer is not available for loading 
(buffer of P(i) 1s full). A low om RCVE(i) indicates that 
there 1S no data present in the input port register 
(buffer of P(i-1) is empty). Both these conditions signify 
that the P(i) must wait. When P(i) begins a send opera- 
Prone OUT 1s generated \ “lfisthes SEND (1) 1s efounds lowe ae 
this e.time, sthen WAIT(i) us produced. ~Similacly,souringea 
receive operation, IN' is generated. WAIT(i) is produced 
4£ RCVE(i) is low. Thus equation 4.3 holds. The difference 
between the operation of the two circuits is the way SEND 


and RCVE signals are set and reset. 


Figure 4.15 shows the timing relations for the opera- 
tions discussed below. The falling edge on OUT' is used 
to trigger a monoshot that produces PL(i). This also 


7 OUT’ and IN’ are described by equation 4.1. 
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resets sthesSEND(1). The: IR(i),.is deactivated by the Am2812 
during the period in which the data from C(i) is loaded 
pho gehe (irsteslob, IRGi) «ls re-activated: by tAm2es 12 caf ter 
this data has moved away towards the output side emptying 
Pier Hucs tes Hou. phigtheyloading sof -thept pestpes Lopeeti |) Eoethe 
buibem, athen SIRGi)awikk notebe activated vuntalathneshiuncse 
slot becomes empty. The raising edge of IR(i) is used to 
See SEND Gi )menabling sPGi,)sto initiate sanother esend awl mathe 
same fashion, when P(i) initiates and completes a receive, 
a negative IN' pulse is produced by the FIFO circuit of 
P(i). The raising edge on IN' signifies the end of the 
input operation after which the input port (A(i)) becomes 
available for another data. This resets RCVE(i) to disable 
P(i) from initiating further receive operations. This IN' 
Sranal 1S also monitored by the FIFO <ciarcuit, of) §P(i 71). 
The PD(i-1) pulse is generated by this FIFO circuit, which 
GumpsetheecoOntenvus. of =-Lhewlaste Ss Ot Of iL SeeDuLLetgueOlmmcoO 
DEcmmOUL DUE EDING ss Phe OR (i 11) erG ee iNac. | Vend UmbnCm ula 
Operation. Once the dumping has been completed, the rais- 
ing) CdgemeonmOR (ial) miSiUSedato tmigger the generation sot 
ASTB(i) which loads the data into A{i). The raising edge 
on ASTB(i) which signifies the completion of the transfer, 
is used to set RCVE(i). The raising edge on OR(i-1) would 


not be present if the corresponding buffer was empty. 


As we mentioned earlier, it requires about 550 ns for 


the data to be loaded into the FIFO buffer. Eh GaacO nist ies 
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tutes an upper bound on the frequency of the output 
instructions that can be executed by the CPU. However, 
the 280 in operation requires about 4.4 us before it can 
initiate another send which is very large compared to 
5950 ns. On the receiver side, the FIFO takes about 900 ns 
to move a new data to its output pins. It requires another 


150snS too Strobe this data into the input port register ot 


the receiver. Consequently, there must be a delay of at 
least 1.05 us between successive input instructions. 
Again, the large time delay does not concern us because 


the CPU in use is Slow. Figure 4.16 shows the final FIFO 
Gitcuin. sDuerto time constraints = this could nO GepenbuTLE 


and tested. 


ASF ineehe®case “of 4HOED controller ?ethe PPrl FOr eeircurt 
could have been built without the use of the PIO unit. 
Referring to figure 15, we see that the FIFO circuit does 
not make use of any of the handshake lines associated with 


EnesPlOvexcept "lor “the “ASTSB> 
4.5. A method of using internal buffer 


In this section we will describe a method of housing 
the buffer in the private memories of the processors 
instead of keeping them external as in the case of above 
two designs. Though this offers a flexibility of altera- 
tion of the buffer size under program control, we will see 


that the FIFO circuit fares better than this method in 
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terms of speed of operation. 


In the internal buffer method, a part of the memory 
M(i)y of “P(i) 25 set aside as a circular butter to store 
Chevdata sent by P(1). The associated» circuitry, wealled 
the» DMAy circuit, would access MUijmte store the data anc 
later retrieve it when P(i+1) requires it. The storage 
and retrieval operations are carried out by doing a direct 
memory access. This has to be done in three parts: first, 
issue a bus request (BUSRQ) to the CPU and wait until the 
CPU relinquishes the bus; next, access the memory to 
either store or retrieve a data; and lastly, update the 
pointersmso thattbuitertfullver emptyetcondvtronss ecan tbe 
detected. Two pointers have to be maintained, one to keep 
track of the next available storage location (HPTR) and 
ehes #SCCONGMLEO Spoinuwito ithe nextigavarlable sdataliror 
retrieval (TPTR). When HPTR is found equal to TPTR on a 
store operation initiated by P(i), it signifies a buffer 
follfeonditions BP(i)thasitotbed halteda@ ateathiswaristant 
from completing the send operation. Simibarby-, ifonea 
receive command from P(it+1), if TPTR is null, then the 
buftesiius lempty@pand hencese(1¥i) shasetorbe halved weerhe 
halting cannot be done using the WAIT line, because, when 
theeeWwAlTweline bis Sactive) athe ®CPUtmaintains its *contrel 
over the bus. Thus the DMA circuit would be unable to 
enter the processor's memory to store or retrieve a data 


which will be necessary to remove the exceptional condi- 
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~] 
\O 


ron. The only way the DMA can operate now is to issue a 
BUSRQ and wait until the processor releases the bus lines. 
Once they are released, the DMA circuit must assume con- 
trol cover them and retain it “untvl thes exceptions are 
removed. This will be done at the end of execution of the 
GUETENE ANStruction., However, the » completion mor the 
instruction lets the sender lose its data on a buffer full 
condition and allows the receiver read an invalid data on 


a buffer empty condition. 


For the sender, the lose of data can be prevented by 
adding an extra latch outside the memory to hold the data 
UnGilet new elocation fat. “HPTReebecomes empty. bos the 
receiver, the reading of the incorrect data can be avoided 
by forcing it to execute another IN instruction by means 
GOmerinternrupts Ors having "EwousINplnstructs0nS einmtandem 
throughout the programs The first instruction © .will” give 
advance information to the circuit about the intentions of 
the CPU and the second will read a valid data. When the 
ics emrODr 1 Onmlcmused an mi NMI Not aticiat Onlmw iw lll Meet ems unine te 
rupt service routine would accomplish the task. The choice 
Ste using sEnterrupts. 1S note very sdesirable sbecausesut 
Piereasesutnem plOgram execUt lO mt. Mc sl yeu | Ce Gur, The 
alternative of having doubled IN instructions in the pro- 


®* This includes the time to recognize the interrupt, 
time to branch to the ISR and the time to execute an input 


instruction. 
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Gram is also equally undesirable because it increases the 
program execution time by 4.4 us® for every input value to 


be read. 


Also during normal operations, when there are no 
exceptional conditions, for every output instruction from 
Pz ) Pei terequinres ’350ens (topstoremthe vdataeinasthe sebuster 
(the time for a memory write) plus at least another 400 ns 
(one T cycle) which is the time required by the CPU to 
regain control of the busses. Similarly, for every input 
tinStructvon@Bexecttedeby@.e lit ie fae otal to hens 50 ens his 
BeQGurInecdEGrCureturbish Atitl) Ss ilorthis thestime periodmor 
a machine cycle (maximum 5 T cycles or 2 us) has to. be 
added since it is the time required by the CPU to release 
the busses. If there are N non-local values to be sent 
and received, then this represents a total stretch of 


N x 3.5 us in the program execution time. 


Ph thes DMAVGircuit, on va Dufter tulle condition weche 
Sender mnaSs tO Walt fOr additional. US» alteretnesinpuG 
port of the receiver becomes empty'®. On a buffer empty 
condition, the additional time is only 150 ns which is the 
time required for Strobing the data into the input porteot 
thee receiver assuming that as’ short §circuit™ pathy is 
— 94,4 us 18 the time to execute an IN instruction. 

1°350 ns to retrieve a byte of data for the receiver, 
S50 ns to store the data that caused the buffer full con- 


dition in the buffer and 400 ns for the CPU to resume its 
normal operation. 
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available from the sender to the receiver to route the 


data directly to the receiver when it is waiting. 


Paecontrast, fon the FIFO circuit encouprccessor  stame 
is wasted in updating the buffer. The extra waiting time 
onebuifer fubl=condition is the widthwof «them Our meoulse 
during normal operation (approximately 1 us) and 150 ns on 
buffer empty condition assuming that a Similar short cir- 
cuit path is available. If such a path is not present then 
the additional waiting time is equal to the ripple through 
time (10 us). Table 4.1 lists the above aieete amen ina 


tabular form. 
4.6. Asynchronous processors in shared I/0 


The HOLD controller and the FIFO circuit we discussed 
above interface two semisynchronous microprocessors. In 
this section, we will examine how these designs should be 
modified so that they could be used to interface two asyn- 
chronous microprocessors. Thiseis» toy give® the user of 
shared 1/0, an option to choose between these two types of 
microprocessors. MC68000 microprocessor is considered for 
Chicee DULDOSE. The approach here is just to mention what 
additional functional units are required for MC68000 in 
comparison to the circuitry we built for 280. No attempt 


is made to give exact details of the circuit. 


MC68000 is a memory mapped microprocessor. That is, 


its input and output instructions are regular memory read 
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and write instructions. The send and receive operations of 
the shared I/0 can be implemented using these read and 
Wail emingseructbions.. Thevaddress Specified in thes instruc— 
tion would be the address of HOLD controller or FIFO cir- 
cult’. There will be one 1/0 controller for each proces— 
Ser. The W/O controllernmot P(r )ewiil) sendecata toel/O con- 


Growler fof —P@itijmand recelve nromithat omens). 


The communication between the CPU and the I/0 con- 
troller takes place in a fully interlocked manner. For 
reading a value from the I/O controller, the CPU uses two 
control lines. The data strobe line (DS)'? when activated 
indicates to the I/O controller that the CPU is ready to 
begin the transfer. The I/O controller loads the data on 
to the data bus and activates the Data Transfer ACK- 
nowledge (DTACK) line indicating that the data is avail- 
able on the data bus. The CPU accesses the data and deac- 
GivateseDS #Thissconveys tothe 1/0 controller =athav eeehe 
data transfer has been properly completed and it should 


vetease the: DITAGK Wine. 


im here hnembenainder Of —ehism chapter, ~HOLDMscontroller 
and FIFO circuit would be referred to as the I/O con- 
trolwer * MG6G8000 would be referred tomas (CPU. 

‘'2There are two data strobe lines. UDS and LDS specify 
the unit of data being transferred: byte (8 bits) or word 
(16 bits). Here we assume that only one byte transfers are 
present. That is, only one of these two lines would be 
used. DS denotes this line. 
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similarly; itor anloutputafrom P(a), the activationsot 
Dsitisignitiesetfhes start/ofetheroperatvonsetThencdgesonsDs 
can besused tonlatch»the data present on the data bus. At 
the end of Joading the data from data bus, the 1/0 con-— 
troller willy activate DTACK line. WhenwDSuuis) ecdeactivared 
by the CPU, the edge on DS can be used to deactivate 
DTACK AL ethee COMpLerion, Of canwsewriten operation er heamiy0© 
controller routes the data to the 1/0 controller of the 
receiver. This is achieved by loading the data on the data 
bus between the 1/0 controllers and generating a latch 
Signal (similar to ASTB of the I/O controller of 280). The 
1/O controllere® usescethesiRD/WR, line ofOCPUNtobidentity 
whether thenCPUListwriting aldatabtovit onetrying: toveread 


ELOnet. 


For thesreadsoperation of thesCeu at ieeDTACKs= 1Se) not 
present within about 250 ns'* from the moment of activa- 
tion Fot BPS, fQthenk) thergcey twantsertoratwo clock/cycwes 
(250 ns). If DTACK is still not! present, another two ©wait 
cycles are inserted in the current machine cycle and so 
on. Thus, when the buffer°is empty, the I/O controller can 
force the CPU to wait by not activating DTACK until a new 
data becomes available from the sender. Similarly, on a 
write operation, the CPU can be made to wait by keeping 


the DTACK inactive if the buffer is full. 


“Assuming a 8MHz clock with a period of 125 ns. 
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At iChisGpornt pelet lus Texamemeb ithe esimilari ties tsand 
differences between the I/O controller we designed for Z80 


and the I/O controller for MC68000. 


(1) FOr both the microprocessors, the) corresponcing 91/0 
controllers have some decoding logic to detect an I/O 
operations Thesdecodingmlogiemis larger form IMGGs 000; 
because its address bus is 23 bits wide compared to 


loMbutsmROreZsC. 


(2) For the 280, the PIO managed the data transfers to 
and from the STD bus. For the MC68000, this has to be 
Gdonesby ithe 17/Oscontrollerm@esgithe P/Oscontnoller has 
to monitor the DS line and use the transitions on 
this, ines in econjunctionewitheethe —ouLpuG. frome the 
decoding logic to load to or load from the system 


data bus. 


(3) For 280, the WAIT signal is generated by the I/O con- 
tHoller@aThiustis agcombinationmosezs0-CPuastonals: and 
SEND, RCVE signals. For MC68000, DTACK is generated 
by the I/O controller. DTACK has to be generated for 
all data transfers. The generation of DTACK is accom- 
Dlishedieby@ucesetting@ ameDTAGK Pilipt lopawhengps is 
activabedebyathe CPUCseWhensDSewus deactivated? eeuhe 

——_**This would not be necessary, if the equivalent of 


PIO,- the Peripheral Interface Adapter (PIA) is used. Here 
we assume that such a device unavailable to us. 
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flipblepacan bes setweThe resettingsot MPTACK mas toebe 
delayed if SEND or RCVE signals are low during write 
or read operations. It is not known how complex this 


delaying mechanism would be. 


Ua peel Ormco 0 sethemlOucontro! lersgenctaresuAGT eas] onalmato 
route the data from the sender to the receiver. For 
MC68000, we will need a similar signal. This signal 
would be used by the I/O controller at the receiver 
to latch the data sent by the 1/0 controller at _ the 


sender. 


We conclude this brief discussion, by noting that the 


added logic required for the 1/0 controller of MC68000 is, 


1) logic to load from and load to the system data bus when 
DS is active. The direction of loading is determined 
by the signal on RD/WR line. 

2 MaorCrreuLteLOnconcrole them OVACK@iane, 


3) decoding circuitry to decode from 23 address bits. 
4.7. Summary 


In this chapter we have discussed the design of two 
Circuits, thes HOLD controller Mandetnes hI hG  c1rculere to 
reduce the processing overhead involved in the software 
synchronization during data transfers between adjacent 
processors. As a result, Tns* is reduced increasing the 


system throughput. The Z80 microprocessor was used because 
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it had the WAIT line feature. Both circuits utilized this 
facility to temporarily halt the processor on buffer empty 
or buffer full condition. By monitoring the system control 
lines and the handshake lines of the PIO, the exceptional 
conditions are easily sensed and corrective measures 
taken. Though the general purpose PIO was used to inter- 
face. to Ene sCPUS itesNS  nOt) GilnhicUlt ss ommeexrendsstnese 
designs to function as separate I/O units interacting 
directly with the CPU. We also discussed, though very 
Dbeterly, about the nature = of the circuit required to 


interface MC68000 type microprocessor. 
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On nnn, nn ——————————————eEEEEeEEEEEeEeEeEeEeEe—EeE————————————————————EEy 


On normal 
operation: 


OUT: 


IN: 


On abnormal 
conditions: 


Buffer full: 


Buffer empty: 


Expansion 


DMA versus FIFO 


DMA 


750 ns to update the 
circular buffer 


2 Ooms to load 
another byte of data 
at themsrnput aapoct 
register from the 
sender's memory 


Additional waiting 
time is 1.1 us 


Additional waiting 
time is 150 ns 


The size (evi the 


buffer can be 
increased or 
decreased under 


software control 


FIFO 


No processor time is 


required to update 
the FIFO memory. 
Requires 550 ns 
between successive 
updates. 


No processor time is 


required to ag k aah 
Same register from 
the head of FIFO 
buffer. Requires 


1.05 us between suc- 
cessive outputs from 
FIFO buffer. 


Additional waiting 
time is approxi- 
mately 1 us 

Additional waiting 


time is 150 ns 


A change in the wir- 
ing is required to 
insert or delete an 
Am2812 unit to 
increase or decrease 
the buffer size 


Table 4.1. Comparison of DMA and FIFO techniques 
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BF 


P(i) 


BF BF 


BE BE BE 


Time 


a aa tes C2@eet bet Cet eee c tc COS titi ete 


P(i) initiates send 1 

P(i) completes send 1 

buffer becomes full (BF) 

P(i) initiates send 2 

must wait since buffer is full (BF) 
P(i+1) initiates receive 1 

P(i+1) completes receive 1 

buffer becomes empty (BE) 

P(i) is released from wait state 
P(i) completes send 2 

buffer becomes full (BF) 

P(i+t1) initiates receive 2 

P(i+1) completes receive 2 

buffer becomes empty (BE) 

P(i+1) initiates receive 3 

must wait since buffer is empty (BE) 
P(i) initiates send 3 

P(i) completes send 3 (BF) 

P(it+1) is released from wait state 
P(it+1) completes receive 3 

buffer becomes empty (BE) 


and so on 


Figure 4.1. Communication protocol between 


P(i) and P(it+1) 
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From CPU 
DO-D7 Control 


Inter-port 


Inter-port 
Data bus 


Data bus 
DO-D7 


Z80—PIO 


ASTB CSTB 
ARDY CRDY 
P(i-1) P(i) P(i+1) 


Figure 4.2. Z80-PIO as connected to P(i) 
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Machine cycle 3 


Clock 


T1 , T2 | Tw T3 ” 


TORQ 


RD 


Input cycle 


CX eo 


ee cae ei OC 


Output cycle 


RD - Read signal WR - Write signal 
IORQ I/O ReQuest 
Tw - is one wait state inserted by the Z80 CPU 


Figure 4.3. Input/Output cycles of the 280 CPU 
without extra wait states 


adapted from [MOST79] 


lanpia ok | a, Stash anst - 
OR ot we vai oe - 
rama : 


4 Inter-Processor Communication Mechanisms o4 


Clock 
oe 4 


walle . 


rere onme sine XL I 


a TZ Tw 


CRDY (i) 
CSTB(i) 
CRDY (i) 
CSTB(i) 
(high) 
When CRDY(i) is already high. 
OUT = IORQ' WR' CE' (C/D)' where CE and C/D are 


decoded from the address lines. 


Figure 4.4. Timing relations on OUT instruction 
for the PIO ; 


adapted from [MOST79] 
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mS 
PODte.NpUtLe pins 
teen scr cto 


Clock 

TZ Tw T3 well 
IN' 

° 
_ ARDY (i) 
read 
Port Input pins xX 

ASTB(i) 
(low) 

When ARDY(i) is already high. 

IN = IORQ' RD' CE1' (C/D)' where CE1 and C/D are 


decoded from the address lines. 


Figure 4.5. Timing relations on IN instruction 
fore toes Er LO 


adapted from [MOST79] 
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Sender [P(i)];: 


Receiver [P(i)]: 


send: OUT (nn),A 
BEGIN send 


load the contents of register A 
into the output port register. nn 
is the address of output port C(i). 


WHILE (SEND(i) EQ 0) 
WAIT(i) s= 0 

END WHILE 

WAIT(i) := 1 


complete the transfer of data to 
the input port register of P(it+1). 


RCVE(i+1) 1 
SEND (i) 0 


END send 
receive: IN A,(nn) 
BEGIN receive 

WHILE (RCVE(i) EQ 0) 

WAIT(i) := 0 

END WHILE 

WAIT(i) := 1 
complete the reading of the input 
port register. nn is the address of 
the input port A(i). 


RCVE (i) 
SEND (i-1) 


END receive 


Figure 4.6. HOLD protocol 
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Clock 


la le 


OUT' 


SEND (i) 


WAIT(i) 
(high) 


CRDY (i) 


fun Vion, ov. ~ tk. Lines Nee en pons, PMS EH CO 
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Figure 4.7a. Timing relations on OUT instruction 
for the HOLD controller with no buffer full 
condition. 
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Figure 4.7b. Timing relations on IN instruction 
for the HOLD controller with no buffer 
empty condition. 
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Figure 4.8a. Timing relations on OUT instruction 
for the HOLD controller with buffer full 
condition. 
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Figure 4.8b. Timing relations on IN instruction 
for the HOLD controller with buffer empty 
COndmon Onn 
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Figure 4.10. Premature release of P(i) from wait 
state during receive 
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Figure 4.11. The problem of deadlock 


Ignoring the WAIT(i-1), the dotted lines indicate 
the series of actions necessary to release P(i) 
from wait State. 
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Communication Mechanisms 101 


Dynamic testing 

This is the sender. it outputs one character every 
10 us when within a loop 

LD A,OFH 

OUT (OFDH),A ;C is output port 

LD B,46H ;70 characters to be output per loop 

LD C7, OFCH 

CALL 0E522H ;wait for the go signal 

LD D,21H sthe first character is ! 

OUTMa (CG) 2 ;(delay=3.2 us) 

INC D schange the character(delay=1.6 us) 

DJINZ LOOP-$ soutput 70 characters(delay=5.2 us) 

LD D,21H ;reset the registers 

LD B,46H 

JR LOOP- $ ;keep going 

HALT 

END START 

This is the receiver. it inputs one character every 
164 us when within a loop. 

LD A,4FH 

OUT (OF9H),A ;A is the input port 

LD B, 46H >70 characters to be input per loop 


LD C, OF8H 


CALL 0E522H ;wait for the go signal 
LD D,21H ;dummy instruction so that both loops 
IN Dec) sstart at the same time(delay=3.2 us) 


EX (SP) ,I1X 
EX (SP),1X 


‘dummy instruction introduces a delay 
-of 10.35 microseconds 


: °15 times (delay = 155.25 us) 

DJNZ LOOP-$ sdelay=5.2 us(total delay=163.65 us) 
LD Drei 

LD B,46H 

JR LOOP- $ 

HALT 

END START 


the time the sender spends waiting between 


successive OUTS = 


154 us. 


Figure 4.12. Dynamic testing program for HOLD controller 
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PL 

IR' 
Data in 

PD 

OR 
Data out O 


PL - Parallel load 
IR - Input ready for loading: 
first slot is empty 


PD - Parallel dump 

OR - Output ready: output 
lines contain valid 
data 


Figure 4.13. Timing diagram for Am2812 


adapted from [AMDI80] 
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32 deep FIFO buffer 


Am2812 
PL IR' OR’ PD 


OUT, IN 
generator 


WAIT(i 
b 


A0-A7 o 
system 
address Decoder 
bus 


IORQ, RD, WR 


IN to (1-1) 


IN(i+1), ASTB(i) 
IN(i+1) 


OR 


ASTB(i+1) — ; PD 


ASTB(i+1), PD generator 


Figure 4.14. Block diagram for FIFO circuit 
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Output cycle 
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p (i) 


. ( RCVE 
\ / (i) 
Input cycle 


Figure 4.15. Timing diagram for FIFO circuit. 
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CHAPTER 5 


Conclusions 


5.1. Summary 


In this thesis, we have presented a description of 
Shared I/O organization. It consists of multiple proces- 
sors with adjacent processors connected to each other. 
Successive repetitions of a program loop are assigned to 
adjacent processors such that more than one loop. repeti- 
tion may execute in parallel. The maximum degree of over- 
lapping @isitcontrol ledibyatcertainechanacteristic *#of the 
program loop, namely, the maximum time to execute a non- 
local segment, Tns*. The throughput is the same as_ the 
rate at which the input can be streamed into the system, 


1/Ts, as long as satisfies two constraints: 


(Sb Gua, ee GapGe 


UNE es Gite 


The first constraint is present when Tns* is greater’ than 


zero and the second when Tns* 1S equal to zero. 


We compared shared I/O with pipeline and pointed out 
that while the performance of the former is only limited 
DymtnewinpUGeaLore rol the laven mm Chem DO Cm en CC huanO LammL ne 
structure forms the corresponding upper bound. We showed 


that both shared 1/0 and pipeline have similar utilization 
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of the available resources. The utilization is maximum 
when there is only one processor in the shared I/0 and 
when there is only one stage in the pipeline. We listed 
additional features of shard I/O such as expandability and 
identical processor units, which.make it more attractive 


for some applications. 


We identified that the problem of communications 
overhead in Shared I/O is serious enough to warrant the 
design of special controllers to manage the message 
traffic between adjacent processors. The HOLD controller 
and, the@hIFONcircuitrypeamakefuseteotegthes PWAITieaicharace 
teristic of Z80 microprocessor to accomplish the task. By 
MOonivOrangecertain control mMinesimchey@arer ablemstor stem. 
porarily halt the processors on exceptional conditions 
thus relieving the processors from testing for such situa- 
tions themselves. The design is reasonably straight for- 
ward and the final outcome is an assembly of simple cir- 
cuit modules such as NAND gates and flipflops except for 
the PIFONcircumtewhich uSeseae@LSUeAnZ2si2 tirst—-in, Sricst— 
OucmbDUurreresWithe little modification, mt neyacanmnunct 100eds 
separate peripheral attachments to the processors to and 
from which to send and receive data. We note in passing 
that the HOLD controller and the FIFO circuit could also 


be used to interface two stages in a pipeline if desired. 
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ceeets 


(2) 


Future work 


The design of the simple interconnection scheme 
assumes that the communication needs are localized; 
that is, non-local values required by one loop 
repetition is produced by the previous repetition. If 
repetition I produces values we) be used in 
repetition J, J > I+1, then a more generalized inter- 
connection scheme may be required. This is necessary 
to route the data. directly to the processors that 
need them, without the necessity of having to move 
them through a chain of intermediate processors. The 
reason is the obvious overheads incurred by all _ the 
processors in passing data that is not useful to them 
to others in the network. However, a generalized 
interconnection network 1s complex and hence expen- 
Sive. Clearly a design tradeoff exists here. (he os 
an interesting problem to investigate the advantages 
of using software to manage a part of the communica- 


tion and hardware for the rest. 


More algorithms should be investigated to examine 
their suitability for implementation on shared I/O. 
Kung reports a number of algorithms for VLSI _ imple- 


mentation [KUNG79]. Most of the algorithms under the 


.< 


= ar 
we *s 


Y - ef eS . - 
seas 2! shee vstevei ol 
7 


{hee llaoni] ere Bees. aatde 


[ 


goal dy dl beriupe? saute 


43 29.489 @C0s Vez vat " a, 


bit! aft 32 aeulav gpettifaws ¢ 
“ae fi 
=? oh a ee [ anus 615") % ely, Pa ieee? | < » {G = 
> & id a } ‘ 
Jieksenre « 24 07 Hénteeiie a an" = naiz 


yreeesan7y GA) ©) Wisdaie ath, en. etuen” a 
a7 plivad Ye! yrbrepnen, 344 rete ands ean 
net. eecean staihewasrat hs welds « seonees cect 


ae Sew eoaert Tev>- 455 hee ele as woaeey 


wits 32 i 3 2t.d682 4200 a seawy,* Sara 
bet li et=asi ‘> leva Cn ae bods aa: a : 26 or 


-nyqan s4n44 (44 padqmos 2i  Weaesen geitasieot eink! 
rt yieget oavta” 
sanninavee eds  ateuléasvar.¢3 i easverni' - 
-eoioumirs aii Gt Stmq a obenad si sresrice ohtay to | 

«3695-902 vod sient a! aots 


7 on) 
a 


oninete' 92 Seas ga eaves a fe sass 
sOMT os tae fo tononamamee 1 


-aiqat tfuv 8 setting 


ats deGnw emii.2d >: 
oe >. 


a 
- 


5.2 Future work 


109 


category of one dimensional linear array' algorithms 


such as discrete Fourier Transform could probably 


implemented on shared I/O. These algorithms are 


CxXCrs 


cuted in a pipelined manner and if implemented in 


shared 1/0, could benefit from some of the features 


of Sharedel20, discussediamn thus thesis. 


‘In a one-dimensional liner array, processors 
arranged in a linear fashion just as in shared I/O. 
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